Week ending on 13 mar, 2019:
- There are a few popular frameworks for text analysis tasks in NLP; NLTK provides several learning algorithms for text classification, including naive Bayes (which I implemented last week), decision trees, and also maximum entropy models. These are all located in the nltk/classify module.
- This week, worked on the maximum entropy model, called 'maxent_classifier' to try to classify the same dataset. This one seems to be much more accurate than the naive Bayes and naturally takes much, much longer to train.
- the big difference between Naive Bayes and Maximum Entropy is that Max Entropy doesn't assume that the features are conditionally independent of each other
- So on the default run I couldn't figure out how (if ever) the training stops (oops), so I found that there are 3 cutoffs you can specify in the param cutoffs, max iterations, negative average log-likelihood, and if a single iteration improves log likelihood by less than x.
- question: do I need to train the maximum entropy classifier every single time I run the program? Can I just train it once and continue to use the same trained model?
- Goals for next time:
- would like to learn how to package my work into a Python module that Shimi can use
- this might take some guidance from Ryan since I have limited knowledge of how to do this and how it would fit into the existing architecture
Week ending on 1 mar, 2019:
- Completed creation of program that uses NLTK's Naive Bayes Classifier to classify text as 1 of 15 dialogue act types, including Statement, ynQuestion, whQuestion, Emotion, etc.
- Trained this classifier on NPS Chat Corpus, containing 10,000+ posts from instant messaging sessions
- took a look at some of these messages and they were quite colloquial and informal
- perhaps we can look for different sets to train the Naive Bayes Classifier on?
- Classifier takes in a tuple of (featuresets, sentence type),
- Featuresets is a list of { 'contains(word)':True, 'contains(otherword)': True, etc} so needed a helper method to tokenize sentences and process them into this format so the classifier could understand
- Sentence type is the string version of the dialogue act (e.g. 'whQuestion')
- if you don't give the classifier a sentence type, you can ask it to .classify({featureset}) to give its guess for the dialogue act type
- you can also use nltk.classify.accuracy(classifier, testset) to find the accuracy of the classifier (what portion of the test cases does it guess correctly?). In my experiences, it was around 65-70%.
- Was intrigued to find that when calling the method classifier.show_most_informative>features(), not only do we get the most informative features (word tokens), but we also get some kind of more specific classifier than the dialogue acts. For example, contains(yes) gets mapped to (yAnswer : Emotion) so I'm not sure if yAnswer is 1 of the 15 dialogue types.
- Exploring other classifiers or training sets? The accuracy wasn't the best (around 70%) which means there's definitely room for improvement
- Working with the data structures the classifier wanted was tedious, so might be hard to create own data / source other people's data
- Goals:
- Looking for other datasets to try
- Create a python module and abstract the features for use on Shimi
Week ending on 22 feb, 2019:
- Basic sentiment analysis research
- basic text classification uses binary labels; something is either x or not x based on a list of words for a tag
- matches # of words quite naively; hence, naive Bayes
- Unfortunately, this really doesn't help us with Shimi because trigger phrases will need to be analyzed more specifically than just by a binary label
- Part of speech tagging and sentence segmentation can help us identify dialogue act types (e.g. statement, emotion, question, continuer).
- So there is a (pseudo) work around to the binary problem, and that is creating multiple labels (e.g. a color could be red, green, blue, white, orange and then making each one a binary feature like "color-is-red" and then comparing).
- Question: what kind of commands do we want this to support? Should Shimi only respond to commands / statements? That would make this a lot easier.
- Once you find that something is a question/command, we can scrape it for "play ____" and then pass whatever this argument is to another method.
- Trigger phrases and sentence similarity possibilities? Try to see how similar every sentence is and if it passes a certain threshold then treat it as a trigger (???)
- reference: "Perhaps if by using a pre-trained encoder like Facebook's InferSent or Google's Sentence Encoder we can get a number which tells us how similar a phrase is to a trigger phrase and if activate the action if it passes a certain threshold."
- Goals:
- use the part of speech tagging to find out what kind of sentence a phrase is
- if time, explore sentence similarity to trigger words and find some kind of threshold
Week of 8 feb, 2019:
- Researched ways to interpret trigger words without matching string literals
- Stemming and lemmatization both attempt to reduce the inflection forms of a word to a common base form.
- Stemming just chops off the ends of words in the hope of finding the base word most of the time, whereas lemmatization uses vocabulary and morphological analysis of words to try to find the base word known as the lemma.
- Most common algorithm for stemming English is Porter's algorithm, which is 5 phases of word reductions applied sequentially (like plural to singular). Many rules use the concept of the "measure" of a word, which checks the number of syllables in a word to see if it is long enough to continue stemming.
- A lemmatizer from NLP will do the full morphological analysis to identify the lemma for each word.
- Now the challenge is post lemmatization, how to use the lemmatized words to determine whether a string matches the trigger words enough in meaning/definition to trigger an action from Shimi.
- Sentence Similarity Methods
- Word embeddings are used in NLP to compute the semantic similarity between words or to find synonyms for a target word.
- Baseline: take the average of word embeddings in all words of a sentence and compare it with the average of another sentence. Obviously not the optimal solution but the most simple.
- Word Mover's Distance: uses word embeddings of words in 2 texts to measure the minimum distance that words in 1 text need to "travel" in semantic space to reach words in another text
- Smooth Inverse Frequency addresses the problem that averaging all word embeddings in a sentence gives too much weight to irrelevant words. SIF takes a weighted average of word embeddings in a sentence and removes common components like "but, just, etc".
- Perhaps if by using a pre-trained encoder like Facebook's InferSent or Google's Sentence Encoder we can get a number which tells us how similar a phrase is to a trigger phrase and if activate the action if it passes a certain threshold.
- The NLTK corpus has a tool called Wordnet which is a symmetric sentence similarity measure.
- Gensim is a similar alternative (https://radimrehurek.com/gensim/index.html).
- Noodling around with NLTK
- Worked with stopwords, tokenizing, and Porter Stemmer (popular stemmer)
- Goals for next week:
- learn to extract meaning and compare the meaning of 2 phrases quantitively
- Create a bank of meaning / definition to match intent with phrase
- for future: integrate with Shimi architecture