Parsing is the first step that converts the unstructured text to a structured form (spreadsheet like format) for ease of analysis
Parsing involves the following key tasks:
Extraction of words (Tokenization)
Normalization of tokens (words / terms)
Stemming
Lemmatization (involved the use of synonyms)
Part of Speech (POS) tagging
Text Filtering:
World Filtering (Filtering of “no value” terms / Using Stop & Start List)
Tokenization
Involves taking a stream of characters (such as a sequence of sentences in a text document) and breaking it down to tokens or terms (E.g. Words, a phrase of words, a number, punctuation mark)
A term identified through tokenization might not be just a single word, but can be a group of words (for example, noun groups)
Bag-of-Tokens Approach
Use the tokenized words for each observation and find out the frequency of each token
The tokenization process involves applying delimiters such as a “period”, “space”, “tab”, and “new line” to unstructured text, and then figuring out instances (occurrences) of each token.
For example, “I love to run” is 4 tokens as the delimiter is the space
The purpose of various punctuation characters is also deciphered during parsing
Examples:
A comma delimiter. E.g. 2,566 (Is 1 or 2 tokens?)
An apostrophe. E.g. I’ll” (Is 1 or 2 tokens, “I” and “will”?)
A dash. E.g. By product ID, 765-7544 (1 or 2 tokens?)
Normalization of tokens
The purpose is to reduce the complexity of words by reducing variants (inflectional form) of the same word to its root or base form
Examples:
“run”, “ran”, “running” are normalized to “run”
Two common techniques of Normalization
Stemming
When normalization is restricted to the grammatical variants of the tense forms or singular and plural
The algorithm that does stemming is known as a stemmer which is based on rules or a dictionary
Rules identify and strip suffixes from the words and deduce stems
E.g. “dogs” -> “dog”
E.g. “ponies” -> “poni”
Lemmatization
Using morphological analysis, the meaning and interpretation of words (semantics) can be analyzed
Words which are semantically equivalent can be traced to its base form (E.g. synonym words)
Morphological analysis is the study of how words are formed
E.g. “vehicles”, “automobiles”, “cars” are inflectional form of the base word “car”
Part of Speech (POS) tagging
To determine the Parts of Speech (POS) for each token to perform further linguistic analyses and to extract more sophisticated features
POS tagging is the task of labeling each token (word) in a sentence with its appropriate part of speech
For each word in a sentence, the algorithm decides whether it is a noun, verb, adjective, adverb, preposition, or conjunction etc.
Sentiment cannot drop adjective and conjuction
POS tagging can be complicated because the same word can represent different parts of speech depending on the context
E.g. The word “institute” is lexically ambiguous because it can either be a noun or a verb
To resolve the ambiguity issue, an understanding of the adjacent terms in the sentence or in the paragraph is required to identify the part of speech for a term correctly
The role by frequency window indicates that "be" and "state" have higher frequency that other term.
The former has 6476 counts of the word while the latter has 1343
The term “abolish” has been stemmed (normalized) to its base word which is represented by a (+) sign
The term, “abuse”, has the role of a Verb and Noun in 17 and 10 documents respectively
The term “accidental cause” is Noun Group
To understanding Text Parsing without stemming, refer to the diagram below:
The terms “abandoned” and “abandoning” are not stemmed. (They are inflectional words of the base word “abandon”
Running the Text Parsing node (Synonyms) to trace the word to it base form
Suppose the word “abuse”, regardless of it is noun or verb, will be treated as noun in all documents, then the word “abuse” (verb) can be added as a synonym as “abuse” (noun)
World Filtering
In parsing, there are always terms which are of no or little value to text analysis
Examples of terms that are frequently occurring words in the English language such as “a”, “the”, “be”, “of”, “in”, “at”, and “to”
A start or stop list helps control the terms that are used in text mining analysis
A stop list consists of stop words that have little or no value in identifying a document or in comparing documents.
Stop lists contain stop words that are articles (the, a, this), conjunctions (and, but, or), and prepositions (of, from, by)
A start list contains the words that you want to include in the analysis
Start lists are mainly used when documents are dominated by technical jargons or in situations where adequate domain expertise is available