NLP

Lexical Processing: First, you will just convert the raw text into words and, depending on your application's needs, into sentences or paragraphs as well.

Syntactic Processing: the grammar of the language to understand what the meaning is.

Semantic Processing : synonyms, antonyms ,semantic relations


Lexical Processing

encoding standards:

    1. American Standard Code for Information Interchange (ASCII)
    2. Unicode
      • UTF-8
      • UTF-16


    • The ‘?’ operator
      • matches the preceding character zero or one time
      • optional presence of a character
    • The ‘*’ operator
      • presence of the preceding character zero or more times.
    • The ‘+’ operator
      • one or more times
      • present at least once
    • The ‘{m, n}’ operator
      • Matches the preceding character ‘m’ times to ‘n’ times.
    • ‘^’ specifies the start of the string
    • ‘$’ specifies the end of the string
    • ‘\w+’ will match any alphanumeric character.
    • ‘[\w\s]+’ matches both alphanumeric characters and whitespaces.
    • match function will only match if the pattern is present at the very start of the string.
    • search function will look for the pattern starting from the left of the string and keeps searching until it sees the pattern and then returns the match
    • re.sub(pattern, replacement, string)
    • findall()



    • Zipf's law states that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on. This is also called the power law distribution.
    • tokenisation - a technique that’s used to split the text into smaller elements. These elements can be characters, words, sentences, or even paragraphs depending on the application you’re working on.
      • Word tokeniser splits text into different words.
      • Sentence tokeniser splits text in different sentence.
      • Tweet tokeniser handles emojis and hashtags that you see in social media texts
      • Regex tokeniser lets you build your own custom tokeniser using regex patterns of your choice.
    • Bag-of-words model
      • number of rows in a bag-of-words model is equal to the number of documents.
      • number of columns is equal to the number of unique words in the documents, i.e. the vocabulary size.
    • Stemming makes sure that different variations of a word represent the same information
      • rule-based technique that just chops off the suffix of a word to get its root form
      • Porter stemmer:
      • Snowball stemmer:
      • faster than the lemmatizer
      • stemmer typically gives less accurate results than a lemmatizer.
    • Lemmatization
      • it takes an input word and searches for its base word by going recursively through all the variations of dictionary words.
      • The base word in this case is called the lemma.
      • WordNet lemmatizer
    • TF-IDF Representation
      • term frequency, and the term IDF stands for inverse document frequency.
      • tft,d=log ( frequency of term t in document d/ total terms in document d
      • idft=log (total number of documents / total documents that have the term t
      • tf-idf score : tf−idf=tft,d∗idft
    • canonicalisation. Simply put, canonicalisation means to reduce a word to its base form.
    • phonetic hashing
      • Soundex algorithm
      • words are reduced to a four-character long code
      • phonetic hashing buckets all the similar phonemes (words with similar sound or pronunciation) into a single bucket and
      • gives all these variations a single hash code
    • Levenshtein edit distance
      • edit distance is the number of edits that are needed to convert a source string to a target string

def lev_distance(source='', target=''):

"""Make a Levenshtein Distances Matrix"""

# get length of both strings

n1, n2 = len(source), len(target)

# create matrix using length of both strings - source string sits on columns, target string sits on rows

matrix = [ [ 0 for i1 in range(n1 + 1) ] for i2 in range(n2 + 1) ]

# fill the first row - (0 to n1-1)

for i1 in range(1, n1 + 1):

matrix[0][i1] = i1

# fill the first column - (0 to n2-1)

for i2 in range(1, n2 + 1):

matrix[i2][0] = i2

# fill the matrix

for i2 in range(1, n2 + 1):

for i1 in range(1, n1 + 1):

# check whether letters being compared are same

if (source[i1-1] == target[i2-1]):

value = matrix[i2-1][i1-1] # top-left cell value

else:

value = min(matrix[i2-1][i1] + 1, # left cell value + 1

matrix[i2][i1-1] + 1, # top cell value + 1

matrix[i2-1][i1-1] + 1) # top-left cell value + 1

matrix[i2][i1] = value

# return bottom-right cell value

print (matrix)

return matrix[-1][-1]

    • pointwise mutual information, also called the PMI
      • replace long terms with a single term
      • PMI(x, y) = log ( P(x, y)/P(x)P(y) )




Syntactic Processing.

Syntactic Processing: the grammar of the language to understand what the meaning is.

    • Words order and meaning
    • Retaining stopwords
    • Morphology of words
    • Parts-of-speech of words in a sentence
  • Part-of-speech tagging
    • task of assigning a part of speech tag (POS tag) to each word
    • WAY
      • Lexicon-based
        • for each word, it assigns the POS tag that most frequently occurs for that word in some training corpus
        • cannot handle unknown/ambiguous words
      • Rule-based
        • rule that is applied to the entire tex
      • Probabilistic (or stochastic) techniques
        • Hidden Markov Model (HMM).
          • Markov assumption states that the probability of a state depends only on the probability of the previous state leading to it.
          • n terms and t tags : t^n
          • states are hidden and they emit observations
          • The transition and the emission probabilities specify the probabilities of transitio
        • Viterbi heuristic / Viterbi algorithm.
          • given a list of observations (words) to be tagged, rather than computing the probabilities of all possible tag sequences, you assign tags sequentially, i.e. assign the most likely tag to each word using the previous tag.
          • it is assumed to be dependent only on the current word and the previous tag
        • parameters required for defining an HMM
          • Emission and Transition Probability
          • Initial state and initial state distribution
        • Emission probability = P(observation|state).
            • P(w|t) = Number of times w has been tagged t/Number of times t appears
        • Transition Probability of tag t1 followed by tag t2:
            • P(t2|t1) = Number of times t1 is followed by tag t2/ Number of times t1 appears
        • P(tag|word) = P(word|tag) * P(tag|previous tag) = Emission probability * Transition probability

      • Deep learning techniques
  • Constituency parsing
    • divide the sentence into constituent phrases such as noun phrase, verb phrase, prepositional phrase etc
    • Constituency parsing checks whether the sentence is semantically correct, i.e. whether a sentence is meaningful.
    • Noun Phrases (NP),
    • Verb Phrases (VP), and
    • Prepositional Phrases (PP)
    • CFG
      • Context-Free Grammars
      • Grammar defines set of rules which parse the sentence into constituents of words which act like a single unit
        • Top-Down Parsing
          • Start from the starting symbol S and produce each word in the sentence
          • infinite loop
        • Bottom-up Parsing
          • Start from the individual words and reduce them to the sentence
          • shift-reduce parser
      • Probabilistic CFG
        • used when we want to find the most probable parsed structure of the sentence
      • Chomsky Normal Form
        • normalized version of the CFG with a standard set of rules defining how production rule must be written.
  • Dependency parsing
    • establish relationships directly between the words themselves

Semantic Processing

  • Infer the meaning of a given piece of text
  • Entities are grouped into what is known as an entity type.
    • Entities exist in the physical world.
    • A function which takes in some parameters and asserts whether the relationship between the parameters predicate is True or False
  • Reification?
    • Representation of a complex set of associations as an instance of an abstract entity type.
    • Reified entity is a virtual entity
  • unsupervised techniques, such as the lesk algorithm, you assign the definition to the ambiguous word which overlaps with the surrounding words maximally
  • supervised techniques, such as naive Bayes (or any classifier for that matter), you take the context-sense set as the training data. The label is the 'sense' and the input is the context words

Distributional Semantics

  • words which appear in the same contexts have similar meanings
  • A word can be identified by the company it keeps.
  • occurrence matrix
  • co-occurrence matrix
    • A square matrix
  • Latent Semantic Analysis (LSA)
    • Singular Value Decomposition (SVD) to reduce the dimensionality of the matrix.
    • resulting dimensions are not interpretable
    • cannot deal with issues such as polysemy
  • Word2Vec
    • used to compute word-embeddings (or word vectors) using some large corpora as the training data.