Applied Text Mining in Python
November 2018
Module 1: Working with Text in Python
Introduction to text mining
A tweet has a topic and sentiment.
What can be done with text:
- Parse
- Find / indentify / Extract info
- Classify text documents
- Search
- Sentiment analysis
- Topic modeling
Handling text in Python
Sentences / input string made up of words made up of characters. Other end make up documents
text.split(' ') # Find words. Can give empty strings as well.
w for w in text in len(w) > 3 # words longer than 3 characters
w for w in text if w.istitle() # Check for first letter capital
w for w in text if w.endswith('s') # Words that end in s
set(text) # Unique words
set(w.lower) # Unique words regardless of capital letters
text.startswith('s')
t in text # substring in string
text.isupper(), text.istitle() # First letter capital
text.isalpha(), text.isdigit(), text.isalnum()
text.splitlines()
text.join(' ') # join all words
text.stript() # Remove white spaces
text.rstipt() # Remove white spaces at back
text.find('t'); text.rfind('t') # first first instance from back.
text.replace('hello', 'yo')
list(text) # To get all characters
Reading files by line
f = open('a.txt', 'r')
f.readline()
f.seek(0) # Set position back to 0
text = f.read() # Read all text
text.splitlines() # Split lines
for line if f:
print(line)
f.write(text)
f.close()
f.closed()
Regular expressions
Find call outs and hash tags
print([word for word in tweet.split() if word.startswith('#')])
import re
re.search('@[A-Za-z0-9_]+', word) # look for these characters in word. + means follows
. # wildcard matches a single character
^ # start of a string
xyz$ # end of a string
[^abc] # not a, b or c
a|b # a or b
() # scoping for operators
\ # espace character
\b # work boundary
\d # any digit
\D # not digit
\s # any whitespace
\w # any alphanumeric
\W # not alphanumeric
* # matches zero of more occurrences
+ # one of more
? # zero or one
{n} # n times
{n,} # at least n times
{,n} # at most n times
{m,n} # between m and n times
re.findall(r'[aeiou]', text) # find vowels
re.findall(r'[^aeiou]', text) # find consonants
regular expressions for dates
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', datestring) # find numeric dates
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|etc.)[a-z]* (?:\d{1,2}, )?\d{4}', datestring) # Find text dates
English and ASCII
ASCII: American Standard Code of Information Interchange
7 bit charachter encoding: 128 valid code
range: 0x00 - 0x7F [(0000 0000) to (0111 111)]
Including all alphabets, digits
diacritic: Work pronounced differently based on accent.
International languages, music symbols, emoticons.
Unicode and UTF-8 is most common now. Industry standard for encoding and representing text.
Over 128,000 characters. One byte of to 4 byes. Unicode Transformational Format - 8-bits.
Further reading:
Assignment
Extract dates from text. Sort them to a common format and put in ascending order. Score is calculated using Kendall's tau (a correlation measure for ordinal data).
Module 2: Basic Natural Language Processing
What is natural language?
Used by humans compared to artificial computer language
natural language processing?
computation, manipulation of natural language.
These evolve (new words, old words drop, meanings change, position of verb changes)
NLP tasks? Count words, count frequency, count unique words, find sentence boundaries, part of speech tagging, parse sentence structure, identify semantic roles . Identify entities in a sentence. Which pronoun refers to which entity.
NLTK - Natural Language Tool Kit
text corpora - large and structured set of texts.
nltk.download()
from nltk.book import *
len(text1) # Number of words
dist = FreqDist(text1) # Unique words
vocab = dist.keys()
dist = [u'Hello'] # How many times does Hello occur
freqwords = [w for w in vocab if len(w) > 5 and dist[w] > 100]
Normalizing and stemming
Different forms of the same words
input = "List listing listed listening"
# Normalization - make all lower
words = input.lower().split(' ')
# Stemming
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words]
Lemmatization (lemma of a word. e.g. 'better' has 'good' as it's lemma
Words that come out to be meaningful. Stemming by resulting stems are all valid works
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in nltk.coprpus.udhr.words('English-Latin1')[:20]
Tokenization
Split a sentence into words / tokens
nltk.word_tokenize(text) # Split words better including . and , and also negative e.g. "n't"
Sentence Splitting
sentences = nltk.sent_tokenize(text)
Advanced NLP Tasks with NLTK
Part-of-speech (POS) Tagging
Provides insights into the word classes / types in a sentence
Conjunction (CC), Noun (NN), Verb (VB)
nltk.help.upenn_tagset('MD')
Split a sentence into words / tokens.
nltk.pos_tag(text_tokenized)
Parsing Sentence Structure
grammer = nltk.CFG.fromstring("""
S -> NP VP
BP -> V NP
NP -> 'Alice | 'Bob'
V -> 'loves'
""")
parser = nltk.ChartParser(grammer)
trees = parser.parse_all(nltk.word_tokenize("Alice loves Bob')
for tree in trees:
print(tree)
text = nltk.word_tokenzie("I saw the man with a telescope")
grammer = nltk.data.load('mygrammar.cfg)
#S -> NP VP
#...
#P -> 'with'
from nltk.corpus import treebank
text = treebank.parsed_sent('wsj_0001.mrg')[0]
Module 3: Classification of Text
'Supervised learning for text'
Given a set of classes e.g. A, B, C. Then assign the correct label to the given input.
Topic identification e.g. type of news (politics, sport); Spam detection; Sentiment analysis; Spelling correction.
Learn a classification model on properties (features; X {x1, ..., xm}) and their importance (weights) from labeled instances (class; y {y1, ..., yk}).
Binary classification (Y=2), Multiclassification (Y > 2), when data what multiple labels multi-label classification
Training phase: What are the features? How to represent then? What model? What model parameters?
Inference phase? What is expected performance? What is a good measure.
Identifying Features from Text
For supervised learning.
Features can be pulled from text in different granularity.
e.g. Words (commonly occurring words e.g. 'the', stop word; Normalization: lower case vs. leave as-is; Stemming / Lemmatization.)
Characters of words e.g. Capital letter
Parts of speech e.g. the weather and not the whether
Grammatical structure, sentence parsing (verb from noun)
Grouping words of similar meaning (semantic) {buy, purchase}, Dates.
Word sequences e.g. bigrams "White House"
Character sub-sequences ('ing' means verb. 'ion' means noun).
Naive Bayes Classifiers
See here for a good overview of Naive Bayes Classifier
Case study: search queries (entertainment, computer science, zoology). Most common is entertainment.
The query is "python" snake (zoology), programming language (computer), Monty python (entertainment). However most common class for python in zoology.
The query is "python download", now most likely to be comp sci.
Probabilistic model - Updated the likelihood of the class given new information.
Prior probability Pr(y = Entertainment), Pr(y = comp sci), Pr(y = zoo). Sum of these is 1.
Posterior probability Pr( y = Entertainment | x = "python"). Probability of entertainment given python is probably lower.
Bayes' rule:
- Posterior probability = (Prior probability x Likelihood) / Evidence; Pr(y | X) = (Pr(y) x Pr(X | y)) / Pr(X)
- Naive Bayes Classification: Pr(y = CS | "Python") = (Pr(y = CS) x Pr ("Python" | y = CS)) / Pr("Python").
- Same again but replace CS with Zoology.
- Last step: Pr(y = CS | "Python") > Pr(y = Zoology | "Python") => y = CS
- Can remove Pr(X) as not particular interested in how often that query arises then goes to
- y* = argmax Pr(y | X) = argmax Pr(y) x Pr(X | y). Where y* is the predicted class
- Naive assumption: Given the class label. Features are assumed to be independent of each other:
- y* = argmax Pr(y | X) = argmax Pr(y) x Sum i=i:n pi x Pr (xi | y)
e.g. "Python download"
y* = argmax Pr(y) x Pr("Python" | y) x Pr("download" | y). Where y is zoology, CS or Entertainment.
e.g. probability of zoology queries (low) x Probability of python given zoology (high) x Probability of download given zoology (very low).
What are the parameters?
- Prior probabilities Pr(y)
- Likelihood Pr(xi | y) for all features xi
3 class (Y = 3), 100 features in X. Number of parameters = |Y| + 2x|X| + x|y| = 603.
Learning parameters:
- Prior probabilities (training data). Count number of instances in training data. Pr(Y) = n / N
- Likelihood Pr(xi | y). Count xi appears in class y.
What happens in Pr(xi | y) = 0? (e.g posterior probability will be 0).
Smooth the parameters (laplace/additive. Add a dummy count) = (k +1) / (p+n) where n is number of features.
Naive Bayes Variations
Two ways to identify features
- Multinomial distribution (each feature value is a count, e,g, word occurrence). Can give weighting based on word occurrence.
- Bernoulli distribution.
Support Vector Machines
Classifier = function on the text -> type or positive/negative
Choosing a decision boundaries:
- Data overfitting: over trains on training data and doesn't do well on test data.
- Occam's razor: simple model generalize well.
Find a linear boundary. Find w(eight) (slope of the line). Linear least squares etc.
Consider a band (margin) instead of a line. Maximum-margin hyperplane. Base classifier on support vectors (a few points). Support vector machines are maximum-margin classifiers.
Only work for binary classifications.
For multi-class:
- One vs. rest (ovr)
- n-class SVM.
Parameter C (regularization parameter). Larger values - less regularization. Vice versa
Linear kernal work best for text data (not rbf)
class_weight: e.g. spam (80%)/not-spam.
Convert categorial features to numeric features.
Normalize features.
Hyperplane hard to interpret.
Learning text classifiers in python
scikit-learn
nltk (interfaces with sklearn and other ML toolkits e.g. Weka).
from sklearn import naive_bayes
clfrNB = naive_bayes.MultinomialNB()
clfrNM.fit(train_data, train_labels)
predicted_labels = clfrNB.predict(test_data)
metrics.f1_score(test_labels, predicted_labels, average='micro')
from sklearn import svm
clfrSVM = svm.SVC(kernel='linear', C=0.1)
clfrSVM.fit(train_data, train_labels)
predicted_labels = clfrSVM.predict(test_data)
Training phase, Inference phase
from sklearn import model_selection
X_train, X_test, y_train, t_test = model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0)
predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv=5)
NLTK has some classification algorithms: SklearnClassifier
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(test_set)
classifier.classify(unlabeled_instance)
classifier.classify_many(unlabeled_instances)
nltk.classify.util.accuracy(classifier, test_set)
classifier.labels()
classifier.show_most_informative_features()
from nltk.classify import SkleanClassifier
from sklean.naive_bayes import MultinominalNB
from sklean.svm import SVC
clfrNV = SkleanClassifer(MultinominalNB()).train(train_set)
clfrSVM = SklearnClassifier(SVC(), kernal='linear').train(train_set)
Tf-idf - Term frequency inverse document frequency
Module 4: Topic Modeling
Semantic text similarity
Grouping similar words into a similar meaning.
Understand tasks: paraphrasing.
WordNet: semantic dictionary of linked words. Think tree.
Path similarity: find shortest path between two words or lowest common subsumer (LCS); Lin Similarity
from nltk.corpus import wordnet as wn
deer = wn.sysnet('deer.n.01')
elk = wn.sysnet('elk.n.01')
deer.path_similarity(elk)
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
deer.lin_similarity(elk, brown_ic)
Collocations and distributional similarity.
Distributional similarity: context. Before, after, within a small window.
Pointwise mutual information
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigranAssocMeasures()
finder = BigranCollocationFinder.from_words(text)
finder.nbest(bigram_measures.pmi, 10)
finder.apply_freq_filter(10)
Topic Modelling
Latent Dirichlet Allocation.
Documents are a mixture of topics.
Coarse-level analysis of what the text represents.
Topics are represented by a word distribution.
What's known: text collect and number of topics. What's not known: the actual topics, topic distribution for each document.
Text clustering. Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA)
Generative Models for Text
How likely you see a word in a text. Unigram model
Mixture model. Mixutre of topics.
LDA. Generative model for a document d:
- choose length of d
- choose a mixture of topics for d
- use topic's multinomial distribution to output words to fill that topic's quota.
How many topics? Interpreting topics.
Can use gensim or lda
pre-process text: tokenize, normalize (lower-case), stop word removal (e.g. the), stemming. Convert tokenized document to a document-term matrix. Build LDA models on the doc-term matrix.
# doc_set is pre-processed text documents
import gensim
from gensim import corpora, models
dictionary = corpora.Dictionary(doc_set)
corpus = [dictionary.doc2bow(doc) for doc in doc_set] # doc 2 bag-of-words
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)
ldamodel.print_topics(num_topics=4, num_words=5)
Information extraction
Extract fields of interest e.g. meta data (author, date, location).
Named entities recognition (NER), relations.
Tag/classify name entity.
NER typically a four-class model:
- PER (person), ORG (orgnaization), LOC/GPE, Other/Outside (any other class).
Co-reference resolution (e.g. he/she using names).
Question answering.
Further reading
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
https://en.wikipedia.org/wiki/Plate_notation