Lexical Processing: First, you will just convert the raw text into words and, depending on your application's needs, into sentences or paragraphs as well.
Syntactic Processing: the grammar of the language to understand what the meaning is.
Semantic Processing : synonyms, antonyms ,semantic relations
encoding standards:
American Standard Code for Information Interchange (ASCII)
Unicode
UTF-8
UTF-16
The ‘?’ operator
matches the preceding character zero or one time
optional presence of a character
The ‘*’ operator
presence of the preceding character zero or more times.
The ‘+’ operator
one or more times
present at least once
The ‘{m, n}’ operator
Matches the preceding character ‘m’ times to ‘n’ times.
‘^’ specifies the start of the string
‘$’ specifies the end of the string
‘\w+’ will match any alphanumeric character.
‘[\w\s]+’ matches both alphanumeric characters and whitespaces.
match function will only match if the pattern is present at the very start of the string.
search function will look for the pattern starting from the left of the string and keeps searching until it sees the pattern and then returns the match
re.sub(pattern, replacement, string)
findall()
Zipf's law states that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on. This is also called the power law distribution.
tokenisation - a technique that’s used to split the text into smaller elements. These elements can be characters, words, sentences, or even paragraphs depending on the application you’re working on.
Word tokeniser splits text into different words.
Sentence tokeniser splits text in different sentence.
Tweet tokeniser handles emojis and hashtags that you see in social media texts
Regex tokeniser lets you build your own custom tokeniser using regex patterns of your choice.
Bag-of-words model
number of rows in a bag-of-words model is equal to the number of documents.
number of columns is equal to the number of unique words in the documents, i.e. the vocabulary size.
Stemming makes sure that different variations of a word represent the same information
rule-based technique that just chops off the suffix of a word to get its root form
Porter stemmer:
Snowball stemmer:
faster than the lemmatizer
stemmer typically gives less accurate results than a lemmatizer.
Lemmatization
it takes an input word and searches for its base word by going recursively through all the variations of dictionary words.
The base word in this case is called the lemma.
WordNet lemmatizer
TF-IDF Representation
term frequency, and the term IDF stands for inverse document frequency.
tft,d=log ( frequency of term ′t′ in document ′d′ / total terms in document ′d′
idft=log (total number of documents / total documents that have the term ′t′
tf-idf score : tf−idf=tft,d∗idft
canonicalisation. Simply put, canonicalisation means to reduce a word to its base form.
phonetic hashing
Soundex algorithm
words are reduced to a four-character long code
phonetic hashing buckets all the similar phonemes (words with similar sound or pronunciation) into a single bucket and
gives all these variations a single hash code
Levenshtein edit distance
edit distance is the number of edits that are needed to convert a source string to a target string
def lev_distance(source='', target=''):
"""Make a Levenshtein Distances Matrix"""
# get length of both strings
n1, n2 = len(source), len(target)
# create matrix using length of both strings - source string sits on columns, target string sits on rows
matrix = [ [ 0 for i1 in range(n1 + 1) ] for i2 in range(n2 + 1) ]
# fill the first row - (0 to n1-1)
for i1 in range(1, n1 + 1):
matrix[0][i1] = i1
# fill the first column - (0 to n2-1)
for i2 in range(1, n2 + 1):
matrix[i2][0] = i2
# fill the matrix
for i2 in range(1, n2 + 1):
for i1 in range(1, n1 + 1):
# check whether letters being compared are same
if (source[i1-1] == target[i2-1]):
value = matrix[i2-1][i1-1] # top-left cell value
else:
value = min(matrix[i2-1][i1] + 1, # left cell value + 1
matrix[i2][i1-1] + 1, # top cell value + 1
matrix[i2-1][i1-1] + 1) # top-left cell value + 1
matrix[i2][i1] = value
# return bottom-right cell value
print (matrix)
return matrix[-1][-1]
pointwise mutual information, also called the PMI
replace long terms with a single term
PMI(x, y) = log ( P(x, y)/P(x)P(y) )
Syntactic Processing: the grammar of the language to understand what the meaning is.
Words order and meaning
Retaining stopwords
Morphology of words
Parts-of-speech of words in a sentence
Part-of-speech tagging
task of assigning a part of speech tag (POS tag) to each word
WAY
Lexicon-based
for each word, it assigns the POS tag that most frequently occurs for that word in some training corpus
cannot handle unknown/ambiguous words
Rule-based
rule that is applied to the entire tex
Probabilistic (or stochastic) techniques
Hidden Markov Model (HMM).
Markov assumption states that the probability of a state depends only on the probability of the previous state leading to it.
n terms and t tags : t^n
states are hidden and they emit observations
The transition and the emission probabilities specify the probabilities of transitio
Viterbi heuristic / Viterbi algorithm.
given a list of observations (words) to be tagged, rather than computing the probabilities of all possible tag sequences, you assign tags sequentially, i.e. assign the most likely tag to each word using the previous tag.
it is assumed to be dependent only on the current word and the previous tag
parameters required for defining an HMM
Emission and Transition Probability
Initial state and initial state distribution
Emission probability = P(observation|state).
P(w|t) = Number of times w has been tagged t/Number of times t appears
Transition Probability of tag t1 followed by tag t2:
P(t2|t1) = Number of times t1 is followed by tag t2/ Number of times t1 appears
P(tag|word) = P(word|tag) * P(tag|previous tag) = Emission probability * Transition probability
Deep learning techniques
Constituency parsing
divide the sentence into constituent phrases such as noun phrase, verb phrase, prepositional phrase etc
Constituency parsing checks whether the sentence is semantically correct, i.e. whether a sentence is meaningful.
Noun Phrases (NP),
Verb Phrases (VP), and
Prepositional Phrases (PP)
CFG
Context-Free Grammars
Grammar defines set of rules which parse the sentence into constituents of words which act like a single unit
Top-Down Parsing
Start from the starting symbol S and produce each word in the sentence
infinite loop
Bottom-up Parsing
Start from the individual words and reduce them to the sentence
shift-reduce parser
Probabilistic CFG
used when we want to find the most probable parsed structure of the sentence
Chomsky Normal Form
normalized version of the CFG with a standard set of rules defining how production rule must be written.
Dependency parsing
establish relationships directly between the words themselves
Infer the meaning of a given piece of text
Entities are grouped into what is known as an entity type.
Entities exist in the physical world.
A function which takes in some parameters and asserts whether the relationship between the parameters predicate is True or False
Reification?
Representation of a complex set of associations as an instance of an abstract entity type.
Reified entity is a virtual entity
unsupervised techniques, such as the lesk algorithm, you assign the definition to the ambiguous word which overlaps with the surrounding words maximally
supervised techniques, such as naive Bayes (or any classifier for that matter), you take the context-sense set as the training data. The label is the 'sense' and the input is the context words
words which appear in the same contexts have similar meanings
A word can be identified by the company it keeps.
occurrence matrix
co-occurrence matrix
A square matrix
Latent Semantic Analysis (LSA)
Singular Value Decomposition (SVD) to reduce the dimensionality of the matrix.
resulting dimensions are not interpretable
cannot deal with issues such as polysemy
Word2Vec
used to compute word-embeddings (or word vectors) using some large corpora as the training data.