November 2018
A tweet has a topic and sentiment.
What can be done with text:
Sentences / input string made up of words made up of characters. Other end make up documents
text.split(' ') # Find words. Can give empty strings as well.w for w in text in len(w) > 3 # words longer than 3 charactersw for w in text if w.istitle() # Check for first letter capitalw for w in text if w.endswith('s') # Words that end in sset(text) # Unique wordsset(w.lower) # Unique words regardless of capital letterstext.startswith('s')t in text # substring in stringtext.isupper(), text.istitle() # First letter capitaltext.isalpha(), text.isdigit(), text.isalnum()text.splitlines()text.join(' ') # join all wordstext.stript() # Remove white spacestext.rstipt() # Remove white spaces at backtext.find('t'); text.rfind('t') # first first instance from back.text.replace('hello', 'yo')list(text) # To get all charactersReading files by line
f = open('a.txt', 'r')f.readline()f.seek(0) # Set position back to 0text = f.read() # Read all texttext.splitlines() # Split linesfor line if f: print(line)f.write(text)f.close()f.closed()Find call outs and hash tags
print([word for word in tweet.split() if word.startswith('#')])import rere.search('@[A-Za-z0-9_]+', word) # look for these characters in word. + means follows. # wildcard matches a single character^ # start of a stringxyz$ # end of a string[^abc] # not a, b or ca|b # a or b
() # scoping for operators\ # espace character\b # work boundary\d # any digit\D # not digit\s # any whitespace\w # any alphanumeric\W # not alphanumeric* # matches zero of more occurrences+ # one of more? # zero or one{n} # n times{n,} # at least n times{,n} # at most n times{m,n} # between m and n timesre.findall(r'[aeiou]', text) # find vowelsre.findall(r'[^aeiou]', text) # find consonantsregular expressions for dates
re.findall(r'\d{1,2}[/-]\d{1,2}[/-]\d{2,4}', datestring) # find numeric datesre.findall(r'(?:\d{1,2} )?(?:Jan|Feb|etc.)[a-z]* (?:\d{1,2}, )?\d{4}', datestring) # Find text datesASCII: American Standard Code of Information Interchange
7 bit charachter encoding: 128 valid code
range: 0x00 - 0x7F [(0000 0000) to (0111 111)]
Including all alphabets, digits
diacritic: Work pronounced differently based on accent.
International languages, music symbols, emoticons.
Unicode and UTF-8 is most common now. Industry standard for encoding and representing text.
Over 128,000 characters. One byte of to 4 byes. Unicode Transformational Format - 8-bits.
Further reading:
Extract dates from text. Sort them to a common format and put in ascending order. Score is calculated using Kendall's tau (a correlation measure for ordinal data).
What is natural language?
Used by humans compared to artificial computer language
natural language processing?
computation, manipulation of natural language.
These evolve (new words, old words drop, meanings change, position of verb changes)
NLP tasks? Count words, count frequency, count unique words, find sentence boundaries, part of speech tagging, parse sentence structure, identify semantic roles . Identify entities in a sentence. Which pronoun refers to which entity.
NLTK - Natural Language Tool Kit
text corpora - large and structured set of texts.
nltk.download()from nltk.book import *len(text1) # Number of wordsdist = FreqDist(text1) # Unique wordsvocab = dist.keys()dist = [u'Hello'] # How many times does Hello occurfreqwords = [w for w in vocab if len(w) > 5 and dist[w] > 100]Different forms of the same words
input = "List listing listed listening"# Normalization - make all lowerwords = input.lower().split(' ')# Stemmingporter = nltk.PorterStemmer()[porter.stem(t) for t in words]Words that come out to be meaningful. Stemming by resulting stems are all valid works
WNlemma = nltk.WordNetLemmatizer()[WNlemma.lemmatize(t) for t in nltk.coprpus.udhr.words('English-Latin1')[:20]Split a sentence into words / tokens
nltk.word_tokenize(text) # Split words better including . and , and also negative e.g. "n't"sentences = nltk.sent_tokenize(text)Provides insights into the word classes / types in a sentence
Conjunction (CC), Noun (NN), Verb (VB)
nltk.help.upenn_tagset('MD')Split a sentence into words / tokens.
nltk.pos_tag(text_tokenized)Parsing Sentence Structure
grammer = nltk.CFG.fromstring("""S -> NP VPBP -> V NPNP -> 'Alice | 'Bob'V -> 'loves'""")parser = nltk.ChartParser(grammer)trees = parser.parse_all(nltk.word_tokenize("Alice loves Bob')for tree in trees: print(tree)text = nltk.word_tokenzie("I saw the man with a telescope")grammer = nltk.data.load('mygrammar.cfg)#S -> NP VP#...#P -> 'with'from nltk.corpus import treebanktext = treebank.parsed_sent('wsj_0001.mrg')[0]'Supervised learning for text'
Given a set of classes e.g. A, B, C. Then assign the correct label to the given input.
Topic identification e.g. type of news (politics, sport); Spam detection; Sentiment analysis; Spelling correction.
Learn a classification model on properties (features; X {x1, ..., xm}) and their importance (weights) from labeled instances (class; y {y1, ..., yk}).
Binary classification (Y=2), Multiclassification (Y > 2), when data what multiple labels multi-label classification
Training phase: What are the features? How to represent then? What model? What model parameters?
Inference phase? What is expected performance? What is a good measure.
For supervised learning.
Features can be pulled from text in different granularity.
e.g. Words (commonly occurring words e.g. 'the', stop word; Normalization: lower case vs. leave as-is; Stemming / Lemmatization.)
Characters of words e.g. Capital letter
Parts of speech e.g. the weather and not the whether
Grammatical structure, sentence parsing (verb from noun)
Grouping words of similar meaning (semantic) {buy, purchase}, Dates.
Word sequences e.g. bigrams "White House"
Character sub-sequences ('ing' means verb. 'ion' means noun).
See here for a good overview of Naive Bayes Classifier
Case study: search queries (entertainment, computer science, zoology). Most common is entertainment.
The query is "python" snake (zoology), programming language (computer), Monty python (entertainment). However most common class for python in zoology.
The query is "python download", now most likely to be comp sci.
Probabilistic model - Updated the likelihood of the class given new information.
Prior probability Pr(y = Entertainment), Pr(y = comp sci), Pr(y = zoo). Sum of these is 1.
Posterior probability Pr( y = Entertainment | x = "python"). Probability of entertainment given python is probably lower.
Bayes' rule:
e.g. "Python download"
y* = argmax Pr(y) x Pr("Python" | y) x Pr("download" | y). Where y is zoology, CS or Entertainment.
e.g. probability of zoology queries (low) x Probability of python given zoology (high) x Probability of download given zoology (very low).
What are the parameters?
3 class (Y = 3), 100 features in X. Number of parameters = |Y| + 2x|X| + x|y| = 603.
Learning parameters:
What happens in Pr(xi | y) = 0? (e.g posterior probability will be 0).
Smooth the parameters (laplace/additive. Add a dummy count) = (k +1) / (p+n) where n is number of features.
Two ways to identify features
Classifier = function on the text -> type or positive/negative
Choosing a decision boundaries:
Find a linear boundary. Find w(eight) (slope of the line). Linear least squares etc.
Consider a band (margin) instead of a line. Maximum-margin hyperplane. Base classifier on support vectors (a few points). Support vector machines are maximum-margin classifiers.
Only work for binary classifications.
For multi-class:
Parameter C (regularization parameter). Larger values - less regularization. Vice versa
Linear kernal work best for text data (not rbf)
class_weight: e.g. spam (80%)/not-spam.
Convert categorial features to numeric features.
Normalize features.
Hyperplane hard to interpret.
scikit-learn
nltk (interfaces with sklearn and other ML toolkits e.g. Weka).
from sklearn import naive_bayesclfrNB = naive_bayes.MultinomialNB()clfrNM.fit(train_data, train_labels)predicted_labels = clfrNB.predict(test_data)metrics.f1_score(test_labels, predicted_labels, average='micro')from sklearn import svmclfrSVM = svm.SVC(kernel='linear', C=0.1)clfrSVM.fit(train_data, train_labels)predicted_labels = clfrSVM.predict(test_data)Training phase, Inference phase
from sklearn import model_selectionX_train, X_test, y_train, t_test = model_selection.train_test_split(train_data, train_labels, test_size = 0.333, random_state = 0)predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv=5)NLTK has some classification algorithms: SklearnClassifier
from nltk.classify import NaiveBayesClassifierclassifier = NaiveBayesClassifier.train(test_set)classifier.classify(unlabeled_instance)classifier.classify_many(unlabeled_instances)nltk.classify.util.accuracy(classifier, test_set)classifier.labels()classifier.show_most_informative_features()from nltk.classify import SkleanClassifierfrom sklean.naive_bayes import MultinominalNBfrom sklean.svm import SVCclfrNV = SkleanClassifer(MultinominalNB()).train(train_set)clfrSVM = SklearnClassifier(SVC(), kernal='linear').train(train_set)Tf-idf - Term frequency inverse document frequency
Grouping similar words into a similar meaning.
Understand tasks: paraphrasing.
WordNet: semantic dictionary of linked words. Think tree.
Path similarity: find shortest path between two words or lowest common subsumer (LCS); Lin Similarity
from nltk.corpus import wordnet as wndeer = wn.sysnet('deer.n.01')elk = wn.sysnet('elk.n.01')deer.path_similarity(elk)from nltk.corpus import wordnet_icbrown_ic = wordnet_ic.ic('ic-brown.dat')deer.lin_similarity(elk, brown_ic)Collocations and distributional similarity.
Distributional similarity: context. Before, after, within a small window.
Pointwise mutual information
import nltkfrom nltk.collocations import *bigram_measures = nltk.collocations.BigranAssocMeasures()finder = BigranCollocationFinder.from_words(text)finder.nbest(bigram_measures.pmi, 10)finder.apply_freq_filter(10)Latent Dirichlet Allocation.
Documents are a mixture of topics.
Coarse-level analysis of what the text represents.
Topics are represented by a word distribution.
What's known: text collect and number of topics. What's not known: the actual topics, topic distribution for each document.
Text clustering. Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA)
How likely you see a word in a text. Unigram model
Mixture model. Mixutre of topics.
LDA. Generative model for a document d:
How many topics? Interpreting topics.
Can use gensim or lda
pre-process text: tokenize, normalize (lower-case), stop word removal (e.g. the), stemming. Convert tokenized document to a document-term matrix. Build LDA models on the doc-term matrix.
# doc_set is pre-processed text documentsimport gensimfrom gensim import corpora, modelsdictionary = corpora.Dictionary(doc_set)corpus = [dictionary.doc2bow(doc) for doc in doc_set] # doc 2 bag-of-wordsldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)ldamodel.print_topics(num_topics=4, num_words=5)Extract fields of interest e.g. meta data (author, date, location).
Named entities recognition (NER), relations.
Tag/classify name entity.
NER typically a four-class model:
Co-reference resolution (e.g. he/she using names).
Question answering.
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
https://en.wikipedia.org/wiki/Plate_notation