Applied Text Mining in Python

November 2018

Module 1: Working with Text in Python

Introduction to text mining

A tweet has a topic and sentiment.

What can be done with text:

  • Parse
  • Find / indentify / Extract info
  • Classify text documents
  • Search
  • Sentiment analysis
  • Topic modeling

Handling text in Python

Sentences / input string made up of words made up of characters. Other end make up documents

text.split(' ') # Find words. Can give empty strings as well.
w for w in text in len(w) > 3 # words longer than 3 characters
w for w in text if w.istitle() # Check for first letter capital
w for w in text if w.endswith('s') # Words that end in s
set(text) # Unique words
set(w.lower) # Unique words regardless of capital letters
text.startswith('s')
t in text # substring in string
text.isupper(), text.istitle() # First letter capital
text.isalpha(), text.isdigit(), text.isalnum()
text.splitlines()
text.join(' ') # join all words
text.stript() # Remove white spaces
text.rstipt() # Remove white spaces at back
text.find('t'); text.rfind('t') # first first instance from back.
text.replace('hello', 'yo')

list(text) # To get all characters

Reading files by line

f = open('a.txt', 'r')
f.readline()

f.seek(0) # Set position back to 0
text = f.read() # Read all text
text.splitlines() # Split lines

for line if f:
   print(line)

f.write(text)
f.close()
f.closed()

Regular expressions

Find call outs and hash tags

print([word for word in tweet.split() if word.startswith('#')])

import re
re.search('@[A-Za-z0-9_]+', word) # look for these characters in word. + means follows

. # wildcard matches a single character
^ # start of a string
xyz$ # end of a string
[^abc] # not a, b or c

a|b # a or b

() # scoping for operators
\ # espace character
\b # work boundary
\d # any digit
\D # not digit
\s # any whitespace
\w # any alphanumeric
\W # not alphanumeric

* # matches zero of more occurrences
+ # one of more
? # zero or one
{n} # n times
{n,} # at least n times
{,n} # at most n times
{m,n} # between m and n times

re.findall(r'[aeiou]', text) # find vowels
re.findall(r'[^aeiou]', text) # find consonants

regular expressions for dates

re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', datestring) # find numeric dates
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|etc.)[a-z]* (?:\d{1,2}, )?\d{4}', datestring) # Find text dates

English and ASCII

ASCII: American Standard Code of Information Interchange

7 bit charachter encoding: 128 valid code

range: 0x00 - 0x7F [(0000 0000) to (0111 111)]

Including all alphabets, digits

diacritic: Work pronounced differently based on accent.

International languages, music symbols, emoticons.

Unicode and UTF-8 is most common now. Industry standard for encoding and representing text.

Over 128,000 characters. One byte of to 4 byes. Unicode Transformational Format - 8-bits.

Further reading:

Assignment

Extract dates from text. Sort them to a common format and put in ascending order. Score is calculated using Kendall's tau (a correlation measure for ordinal data).

Module 2: Basic Natural Language Processing

What is natural language?

Used by humans compared to artificial computer language

natural language processing?

computation, manipulation of natural language.

These evolve (new words, old words drop, meanings change, position of verb changes)

NLP tasks? Count words, count frequency, count unique words, find sentence boundaries, part of speech tagging, parse sentence structure, identify semantic roles . Identify entities in a sentence. Which pronoun refers to which entity.

NLTK - Natural Language Tool Kit

text corpora - large and structured set of texts.

nltk.download()
from nltk.book import *
len(text1) # Number of words
dist = FreqDist(text1) # Unique words
vocab = dist.keys()
dist = [u'Hello'] # How many times does Hello occur
freqwords = [w for w in vocab if len(w) > 5 and dist[w] > 100]

Normalizing and stemming

Different forms of the same words

input = "List listing listed listening"
# Normalization - make all lower
words = input.lower().split(' ')

# Stemming
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words]

Lemmatization (lemma of a word. e.g. 'better' has 'good' as it's lemma

Words that come out to be meaningful. Stemming by resulting stems are all valid works

WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in nltk.coprpus.udhr.words('English-Latin1')[:20]

Tokenization

Split a sentence into words / tokens

nltk.word_tokenize(text) # Split words better including . and , and also negative e.g. "n't"

Sentence Splitting

sentences = nltk.sent_tokenize(text)

Advanced NLP Tasks with NLTK

Part-of-speech (POS) Tagging

Provides insights into the word classes / types in a sentence

Conjunction (CC), Noun (NN), Verb (VB)

nltk.help.upenn_tagset('MD')

Split a sentence into words / tokens.

nltk.pos_tag(text_tokenized)

Parsing Sentence Structure

grammer = nltk.CFG.fromstring("""
S -> NP VP
BP -> V NP
NP -> 'Alice | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammer)
trees = parser.parse_all(nltk.word_tokenize("Alice loves Bob')
for tree in trees:
    print(tree)

text = nltk.word_tokenzie("I saw the man with a telescope")
grammer = nltk.data.load('mygrammar.cfg)
#S -> NP VP
#...
#P -> 'with'

from nltk.corpus import treebank
text = treebank.parsed_sent('wsj_0001.mrg')[0]

Module 3: Classification of Text

'Supervised learning for text'

Given a set of classes e.g. A, B, C. Then assign the correct label to the given input.

Topic identification e.g. type of news (politics, sport); Spam detection; Sentiment analysis; Spelling correction.

Learn a classification model on properties (features; X {x1, ..., xm}) and their importance (weights) from labeled instances (class; y {y1, ..., yk}).

Binary classification (Y=2), Multiclassification (Y > 2), when data what multiple labels multi-label classification

Training phase: What are the features? How to represent then? What model? What model parameters?

Inference phase? What is expected performance? What is a good measure.

Identifying Features from Text

For supervised learning.

Features can be pulled from text in different granularity.

e.g. Words (commonly occurring words e.g. 'the', stop word; Normalization: lower case vs. leave as-is; Stemming / Lemmatization.)

Characters of words e.g. Capital letter

Parts of speech e.g. the weather and not the whether

Grammatical structure, sentence parsing (verb from noun)

Grouping words of similar meaning (semantic) {buy, purchase}, Dates.

Word sequences e.g. bigrams "White House"

Character sub-sequences ('ing' means verb. 'ion' means noun).

Naive Bayes Classifiers

See here for a good overview of Naive Bayes Classifier

Case study: search queries (entertainment, computer science, zoology). Most common is entertainment.

The query is "python" snake (zoology), programming language (computer), Monty python (entertainment). However most common class for python in zoology.

The query is "python download", now most likely to be comp sci.

Probabilistic model - Updated the likelihood of the class given new information.

Prior probability Pr(y = Entertainment), Pr(y = comp sci), Pr(y = zoo). Sum of these is 1.

Posterior probability Pr( y = Entertainment | x = "python"). Probability of entertainment given python is probably lower.

Bayes' rule:

  • Posterior probability = (Prior probability x Likelihood) / Evidence; Pr(y | X) = (Pr(y) x Pr(X | y)) / Pr(X)
  • Naive Bayes Classification: Pr(y = CS | "Python") = (Pr(y = CS) x Pr ("Python" | y = CS)) / Pr("Python").
  • Same again but replace CS with Zoology.
  • Last step: Pr(y = CS | "Python") > Pr(y = Zoology | "Python") => y = CS
  • Can remove Pr(X) as not particular interested in how often that query arises then goes to
  • y* = argmax Pr(y | X) = argmax Pr(y) x Pr(X | y). Where y* is the predicted class
  • Naive assumption: Given the class label. Features are assumed to be independent of each other:
  • y* = argmax Pr(y | X) = argmax Pr(y) x Sum i=i:n pi x Pr (xi | y)

e.g. "Python download"

y* = argmax Pr(y) x Pr("Python" | y) x Pr("download" | y). Where y is zoology, CS or Entertainment.

e.g. probability of zoology queries (low) x Probability of python given zoology (high) x Probability of download given zoology (very low).

What are the parameters?

  • Prior probabilities Pr(y)
  • Likelihood Pr(xi | y) for all features xi

3 class (Y = 3), 100 features in X. Number of parameters = |Y| + 2x|X| + x|y| = 603.

Learning parameters:

  • Prior probabilities (training data). Count number of instances in training data. Pr(Y) = n / N
  • Likelihood Pr(xi | y). Count xi appears in class y.

What happens in Pr(xi | y) = 0? (e.g posterior probability will be 0).

Smooth the parameters (laplace/additive. Add a dummy count) = (k +1) / (p+n) where n is number of features.

Naive Bayes Variations

Two ways to identify features

  • Multinomial distribution (each feature value is a count, e,g, word occurrence). Can give weighting based on word occurrence.
  • Bernoulli distribution.

Support Vector Machines

Classifier = function on the text -> type or positive/negative

Choosing a decision boundaries:

  • Data overfitting: over trains on training data and doesn't do well on test data.
  • Occam's razor: simple model generalize well.

Find a linear boundary. Find w(eight) (slope of the line). Linear least squares etc.

Consider a band (margin) instead of a line. Maximum-margin hyperplane. Base classifier on support vectors (a few points). Support vector machines are maximum-margin classifiers.

Only work for binary classifications.

For multi-class:

  • One vs. rest (ovr)
  • n-class SVM.

Parameter C (regularization parameter). Larger values - less regularization. Vice versa

Linear kernal work best for text data (not rbf)

class_weight: e.g. spam (80%)/not-spam.

Convert categorial features to numeric features.

Normalize features.

Hyperplane hard to interpret.

Learning text classifiers in python

scikit-learn

nltk (interfaces with sklearn and other ML toolkits e.g. Weka).

from sklearn import naive_bayes
clfrNB = naive_bayes.MultinomialNB()
clfrNM.fit(train_data, train_labels)
predicted_labels = clfrNB.predict(test_data)
metrics.f1_score(test_labels, predicted_labels, average='micro')

from sklearn import svm
clfrSVM = svm.SVC(kernel='linear', C=0.1)
clfrSVM.fit(train_data, train_labels)
predicted_labels = clfrSVM.predict(test_data)

Training phase, Inference phase

from sklearn import model_selection
X_train, X_test, y_train, t_test = model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0)
predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv=5)

NLTK has some classification algorithms: SklearnClassifier

from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(test_set)
classifier.classify(unlabeled_instance)
classifier.classify_many(unlabeled_instances)
nltk.classify.util.accuracy(classifier, test_set)
classifier.labels()
classifier.show_most_informative_features()

from nltk.classify import SkleanClassifier
from sklean.naive_bayes import MultinominalNB
from sklean.svm import SVC
clfrNV = SkleanClassifer(MultinominalNB()).train(train_set)
clfrSVM = SklearnClassifier(SVC(), kernal='linear').train(train_set)

Tf-idf - Term frequency inverse document frequency

Module 4: Topic Modeling

Semantic text similarity

Grouping similar words into a similar meaning.

Understand tasks: paraphrasing.

WordNet: semantic dictionary of linked words. Think tree.

Path similarity: find shortest path between two words or lowest common subsumer (LCS); Lin Similarity

from nltk.corpus import wordnet as wn
deer = wn.sysnet('deer.n.01')
elk = wn.sysnet('elk.n.01')
deer.path_similarity(elk)

from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
deer.lin_similarity(elk, brown_ic)

Collocations and distributional similarity.

Distributional similarity: context. Before, after, within a small window.

Pointwise mutual information

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigranAssocMeasures()

finder = BigranCollocationFinder.from_words(text)
finder.nbest(bigram_measures.pmi, 10)
finder.apply_freq_filter(10)

Topic Modelling

Latent Dirichlet Allocation.

Documents are a mixture of topics.

Coarse-level analysis of what the text represents.

Topics are represented by a word distribution.

What's known: text collect and number of topics. What's not known: the actual topics, topic distribution for each document.

Text clustering. Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA)

Generative Models for Text

How likely you see a word in a text. Unigram model

Mixture model. Mixutre of topics.

LDA. Generative model for a document d:

  • choose length of d
  • choose a mixture of topics for d
  • use topic's multinomial distribution to output words to fill that topic's quota.

How many topics? Interpreting topics.

Can use gensim or lda

pre-process text: tokenize, normalize (lower-case), stop word removal (e.g. the), stemming. Convert tokenized document to a document-term matrix. Build LDA models on the doc-term matrix.

# doc_set is pre-processed text documents
import gensim
from gensim import corpora, models
dictionary = corpora.Dictionary(doc_set)
corpus = [dictionary.doc2bow(doc) for doc in doc_set] # doc 2 bag-of-words
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)
ldamodel.print_topics(num_topics=4, num_words=5)

Information extraction

Extract fields of interest e.g. meta data (author, date, location).

Named entities recognition (NER), relations.

Tag/classify name entity.

NER typically a four-class model:

  • PER (person), ORG (orgnaization), LOC/GPE, Other/Outside (any other class).

Co-reference resolution (e.g. he/she using names).

Question answering.

Further reading

David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent Dirichlet Allocation. Journal of Machine Learning Research (JMLR); 3(Jan):993-1022, 2003

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

https://en.wikipedia.org/wiki/Plate_notation

http://www.nltk.org/howto/wordnet.html

https://code.google.com/archive/p/word2vec/