Get the starter code on GitHub Classroom.
Assuming you're done with Lab 4, you should clean up your Docker image (with the massive GloVe file). The easiest way to do this is just to empty out all your Docker images (using docker system prune -a ) and then to re-pull the Docker image using the Docker instructions.
In this lab you will investigate the English Lexical Sample Task (Task 6) data taken from Senseval-3.
History: Senseval used to be a regular workshop for a set of shared tasks, in which a group of organizers would help prepare and release a shared dataset and groups from different intitutions would compete to see who could design the system with the highest performance for that task. Conference-affiliated shared tasks and government-sponsored "bakeoffs" (where you could receive a federal grant for computing resources and research travel in exchange for participating in a shared task like this) have been a significant anchor for the NLP community over the past several decades. Benefits of these tasks are shared high-quality datasets that can continue to be used as a benchmark, increased interest and hype around challenging problems that might otherwise be intimidating to approach, clear channels of funding for work on hard-to-tackle problems, and an invitation for cross-institution interaction and collaboration with other researchers in your specific subfield. (If you want to see a very meta paper about this history, here's a 2012 analysis of the history of NLP using topic models).
For today, you'll build a supervised decision list classifier, trained on the provided training data and tested on the provided test data.
Answers to written questions should be added to analysis.md, which should be converted into a pdf (analysis.pdf) for your final submission using pandoc. If you need help doing this because you don't have pandoc installed, you can use Docker to help.
The data provided by the Senseval-3 competition is in /cs/cs159/data/senseval3/. Training and testing data are in separate subdirectories. Before you begin work on the lab, you need to spend a few minutes looking at the source data. In particular, examine the structure of the files /cs/cs159/data/senseval3/train/EnglishLS.train and /cs/cs159/data/senseval3/test/EnglishLS.test.
When you look through the training data file, you will see the text <lexelt item="activate.v">. This indicates that until the next lexelt tag, all of the training examples are for the ambiguous verb “activate”. There are 228 training examples provided for the lexelt activate.v. Each training example, called an instance, is given an instance id. The first instance has the id activate.v.bnc.00024693. The next line in the file indicates the sense label for the word “activate” in the text that follows. The first instance is labeled with sense 38201.
After the instance id and answer are provided, the there is a short piece of text labeled with the tag <context>. This text contains the word “activate” and provides a few sentences worth of context to help you determine which sense is the correct sense. In addition, the context indicates the word that you are trying to disambiguate. In the first context, you will find <head>activate</head>. It might seem strange to label the “head” word like this, but there are a few good reasons to do so. First, sometimes the word in the context isn’t the same as the lexelt. For example, in the second instance, the head word is “activated”, not “activate”. Second, it is possible that the word you are trying to disambiguate appears multiple times in the same piece of text, so marking the head word lets you know which one of the words you should be determining the sense of. Finally, sometimes there are multiple head words marked in the same context, indicating that those words all share the same sense.
Spend some time looking through the training data file. You should find that some instances are labeled with two senses, indicating that annotators either disagreed on what the correct sense was, or they believed that both senses were represented. In addition, some instances are labeled with the sense “U”, meaning “Unclear”. This means that the annotators weren’t clear which sense to assign to the word. There are even some cases where the word was given two sense labels but one of them as a “U”, meaning that some of the annotators agreed on one sense but other annotators were unclear on how to assign a sense to it.
The source data for the Senseval task is tricky to parse correctly. The starter code includes a file, Lexelt.py, that will do this parsing for you. Before you add to the file, make sure that you understand it, and that you can use it directly in the Python 3 interpreter. Here is a sample of what commands you can use with these data:
$ python3
>>> from Lexelt import get_data, get_key
>>> train_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train', 'r')
>>> train_data = get_data(train_fp) # This will take a few seconds to load in
>>> train_data.keys()
dict_keys(['activate.v', 'add.v', 'appear.v', ..., 'write.v'])
>>> train_data['smell.v'].keys() # Find all training example ids for the verb 'smell'
dict_keys([['smell.v.bnc.00006854', 'smell.v.bnc.00006855', 'smell.v.bnc.00018122', 'smell.v.bnc.00027848', 'smell.v.bnc.00029806', 'smell.v.bnc.00032140', 'smell.v.bnc.00033773', 'smell.v.bnc.00037068', ... ,'smell.v.bnc.00687246']])
>>> this_instance = train_data['smell.v'].get('smell.v.bnc.00018122')
>>> this_instance
<Lexelt.LexEltInstance object at 0x7f47a2624fd0>
>>> print(' '.join(this_instance.words)) # See the text of the example
Used to speed the germination of seeds and potatoes , said the inspector , and as a cleaning solvent . And it 's odourless , chimed in Constable Bewman helpfully . So there was no need to have anything highly scented or smelling strongly on the table , said Henry at once . I hope you never take it into your head to commit a murder , sir , said the Inspector . You do seem to have an eye for essentials .
>>> heads = this_instance.heads # Extract which token(s) in the example are the head word
>>> heads
[42]
>>> this_instance.words[heads[0]]
'smelling'
>>> this_instance = train_data['smell.v'].get('smell.v.bnc.00006855')
>>> this_instance.heads # Unusually, this one has two!
[19, 25]
>>> trainkey_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train.key', 'r')
>>> get_key(trainkey_fp, train_data)
>>> train_data['smell.v'].get('smell.v.bnc.00018122').answers
['3893505']
>>> train_data['smell.v'].get('smell.v.bnc.00006855').answers
['3893503', 'U']
This section asks you to write some helper functions that will make it easier to answer the questions in the following section. All of these functions should be new member functions of the LexElt class.
pos(self): Returns the part of speech of a LexElt object.
num_headwords(self): Returns a list with the number of headwords in each instance of a LexElt object.
num_answers(self): Returns a list with the number of answers in each instance of a LexElt object.
get_all_senses(self): Returns a single list containing all of the senses of all of the instances of a LexElt object. Excludes U senses. Note that these can be listed either as numerical IDs or longer codes, depending on the data source (Wordsmyth uses just an integer ID like 38201, while WordNet expresses metadata in the sense ID like organization%1:04:02::)
count_unique_senses(self): Returns the number of unique senses used by at least one of the instances of a LexElt object. Excludes U.
most_frequent_sense(self): Returns the sense that is associated with the largest number of instances of a LexElt object (again, excluding U). Ties may be broken arbitrarily (that is, any way you like).
Check in. Confirm that your functions work before moving on:
>>> lexelt = train_data['activate.v']
>>> lexelt.pos()
'v'
>>> lexelt.num_headwords()
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> lexelt.num_answers()
[1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1]
>>> lexelt.get_all_senses()
['38201', '38201', '38202', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38202', '38202', '38201', '38203', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38203', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38202', '38201', '38201', '38201', '38202', '38201', '38201', '38202', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38202', '38203', '38203', '38202', '38204', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38202', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38202', '38204', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38202', '38201', '38201', '38201', '38202', '38201', '38202', '38202']
>>> lexelt.count_unique_senses()
4
>>> lexelt.most_frequent_sense()
'38201'
You should write the solutions to the following questions in the main function of Lexelt.py, and include your answers in your analysis.md file. The first question has been answered for you in the starter code. Note that all of these questions should only require the training data; don't touch the test data yet!
How many different “lexelts” (lexical elements) are there? The lexelts represent the polysemous word types that you are being asked to disambiguate.
What is the breakdown of part of speech (verbs, nouns and adjectives) of these lexelts? You can determine the part of speech by looking at the suffix attached to the lexelt: activate.v is a verb, shelter.n is a noun, and hot.a is an adjective.
How many instances (training examples) are there for the lexelt organization.n?
The introduction to the lab mentioned that an instance can have multiple head words and how an instance can have multiple answers. Make a small table showing the breakdown on the number of head words per instance. Make another table showing the breakdown on the number of answers per instance. Don’t break this down per lexelt - just one small table that summarizes all of the data.
How many senses of activate.v are represented in the training data? (Do not count “U” as a sense for this question.)
One common baseline for this task is to assume that you guess uniformly at random among the possible senses (that is, with equal probability for each sense). However, rather than actually making random guesses, you can just assume that you will get 1/n of the instances for a particular lexelt correct, where n is the number of senses for that lexelt. For example, if there 5 senses for a lexelt w, the random baseline will get 1/5=20% of the instances of the lexelt w correct. It doesn’t matter if the number of instances for w is not a multiple of n: if there are 37 instances for a lexelt with 5 senses, you will say that a random guess will give you 37/5=7.4 of them correct.
The question for you is as follows: what percentage of the instances in the training data would you be able to label correctly if you just guessed randomly? Two important notes:
You should ignore U when counting how many senses a lexelt has for this question.
This question should not require you to actually look at the correct sense labels in answers for any examples. It doesn’t matter what the actual answers are: you are just trying to get an estimate for how many you would expect to get right if you just guessed randomly.
Another common baseline is the “most frequent sense”, or MFS, baseline. In the MFS baseline, you find the most frequent sense for each lexelt and the label every instance with that most frequent sense. What percentage of the instances in the training data would you label correctly using the most frequent sense baseline? For this question, you can assume that only the first sense listed in answers is the correct label for an instance. However, you should use all of the listed senses for all of the instances to determine the most frequent sense.
Is the MFS baseline higher or lower than you expected? Discuss.
Now, we'll start to use our functions to inspect the test data.
How many instances of activate.v are there in the test data?
There are two similar questions that are not repeats of one another:
Are there any senses of organization.n that are represented in the training data but not in the test data? Assuming the answer is yes, would this be a problem? Discuss briefly.
Are there any senses of organization.n that are represented in the test data but not in the training data? Assuming the answer is yes, would this be a problem? Discuss briefly.
For each lexelt in the training data, find the most frequent sense. Do the same for the test data. How many (and what percentage of) lexelts have the same most frequent sense in the training data and the test data? You can choose how to handle ties, but you must explain in your answer what method that you used. Discuss the implications of any disparities you found between the MFS in the training and the MFS in the test.
Use the MFS sense from the training data to label the test data. What is your accuracy labeling the test data using the MFS sense baseline? Compare to how the random baseline and the MFS baseline performed when applied to the training data. How does this match your intuition? Discuss. Note that if an instance is labeled with U, then U is the correct labeling for the sense. Your system is supposed to be able to predict when U is appropriate for the instance, so if you predict an actual sense label (e.g., '38201' or 'organization%1:04:02::') but the true label is U, you are incorrect.
In this week’s lab, you will be writing a decision list classifier. Later in the semester, you will use scikit-learn, which has many prebuilt classifiers for you to use. One of your goals this week is build your decision list classifier with an interface similar to that of the classifiers in scikit-learn so that you can compare the performance of your classifier against other classifiers you could have used, and so that you start to become familiar with the scikit-learn API.
One very important thing to keep in mind as you work through this lab is that your are not training a classifier for all of the lexelts at once. Rather, you will train a classifier on each lexelt (e.g. activate.v), then use that classifier to label the test instances for that lexelt. You can then throw that classifier away and train a new classifier on the next lexelt (e.g. add.v) and use that classifier to label the test instances for that lexelt, looping through the list of lexelts.
In order to train a supervised classifier, you will need training data. For this task, you are provided with multiple instances for each lexelt. You will represent each instance as a feature vector. A feature vector is a numerical representation for a particular training instance. Each feature vector in the training data will have an accompanying target, a numerical representation of the sense that the instance was labeled with.
In order to conform with the scikit-learn interface, you will put all of your feature vectors (for a single lexelt, e.g. 'activate.v') into a numpy array, where each row of the array is an individual feature vector. In the scikit-learn documentation, this array of feature vectors is denoted by the variable X. If there are n training examples and m features, X will be an n ×m array. Similarly, you will put all of your targets (for a single lexelt) into a 1-dimensional numpy array. The scikit-learn documentation refers to this array of targets as y. Each element y[i] is the target (often called the class label) for the ith training example in X, indexed from 0.
For the word sense disambiguation problem, you will extract two types of features:
bag of words features
collocational features
Bag of words features are simply the count for each word within some window size of the head word. For this lab, we will set the window size large enough to include all of the tokens in the context. Don’t throw away things like punctuation, numbers, etc. If your classifier finds those tokens useful to distinguish senses, you'll want to make sure they're available to use. If they are just noise, your classifier should be able to ignore them without you needing to remove them.
For example, for instance id activate.v.bnc.00044852, the bag of words features would show that the token ‘receptors’ appeared 4 times in the context, the token ‘(‘ appeared 2 times, and the token ‘dimensions’ appeared 1 time.
Collocational features are the n-grams that include the head word and their counts. For this lab, you will extract the bigrams and trigrams that include the head word. If there are multiple head words in a single context, you will extract bigrams and trigrams for each one. The instance above, activate.v.bnc.00044852, contains the following text (where activated is the head word):
… upon which parts of the sensory system are activated : stimulation of the retinal receptors …
The collocational features in this instance are the bigrams “are activated” and “activated :”; the trigrams “system are activated”, “are activated :”, and “activated : stimulation”. You can represent these n-grams as a single “word” by joining the tokens in the n-gram together using underscores, such as “system_are_activated”. (Seriously, this is what NLP researchers often do.) Just like with bag-of-words features, you’ll want to keep track of the count of each of these n-grams. In this case, the collocation “system_are_activated” appears 1 time.
Add following member functions to the LexEltInstance class:
bow_features(self) should return a Counter object over all of the words in the instance.
colloc_features(self) should return a Counter object over all of the collocational features (bigrams and trigrams) for the instance.
make_features(self) should combine the output of bow_features and colloc_features into a single Counter, and should store that Counter as a features attribute of the LexEltInstance object.
get_feature_labels(self) should return all of the keys of the object’s features attribute.
You are welcome to write additional helper member functions; perhaps having separate functions to find bigrams and trigrams would be helpful here?
Check-in. Now you should be able to interact with a LexEltInstance like this:
>>> this_instance = lexelt.get('activate.v.bnc.00044852')
>>> this_instance.make_features()
>>> this_instance.features
Counter({'the': 20, 'of': 14, ',': 10, 'to': 7, 'in': 6, 'are': 6, 'and': 5, 'experience': 4, 'system': 4, '.': 4, 'receptors': 4,...})
>>> this_instance.get_feature_labels()
dict_keys(['For', 'neurophysiologists', 'and', 'neuropsychologists', ',', 'the', 'way', 'forward', 'in', 'understanding', 'perception', 'has', 'been', 'to', 'correlate', 'these', 'dimensions', 'of', 'experience', 'with', 'firstly', 'material', 'properties', 'experienced', 'object', 'or', 'event', '(', 'usually', 'regarded', 'as', 'stimulus', ')', 'secondly', 'patterns', 'discharges', 'sensory', 'system', '.', 'Qualitative', 'Aspects', 'Experience', 'The', 'quality', 'modality', 'depends', 'less', 'upon', 'energy', 'reaching', 'nervous', 'than', 'which', 'parts', 'are', 'activated', ':', 'stimulation', 'retinal', 'receptors', 'causes', 'an', 'light', ';', 'inner', 'ear', 'gives', 'rise', 'sound', 'so', 'on', 'Muller', "'s", 'nineteenth', '-', 'century', 'doctrine', 'specific', 'energies', 'formalized', 'ordinary', 'observation', 'that', 'different', 'sense', 'organs', 'sensitive', 'physical', 'world', 'when', 'they', 'stimulated', 'sensations', 'those', 'It', 'was', 'proposed', 'there', 'endings', 'within', 'attuned', 'types', 'example', 'eye', 'respond', 'cochlear', 'vibrations', 'air', 'are_activated', 'activated_:', 'system_are_activated', 'are_activated_:', 'activated_:_stimulation'])
Now that you know what features to extract, you need to store them in a numpy array. To do that, you first need to know the complete set of features that you have (for one lexelt!). The first step, therefore, is to go through each instance of a lexelt and collect the features of them.
Write a member function of the LexElt class called get_features(). This function should do the following:
Call make_features for all of the LexElt object’s instances
Build a list (sorted in lexicographic, i.e., alphabetical order) of all of the features that show up at least once in at least one instance for the LexElt object.
Create a numpy array the right size to hold a count for each of the features for each of the instances.
Pass the list of feature names to a new member function of the LexEltInstance class, to_vector, that will return a 1-dimensional numpy array.
Store the array from to_vector in the right row of the feature array.
get_features should then return both the list of feature labels and the numpy array with all of the features for all of the instances.
The to_vector(self, feature_list) function in the LexEltInstance class should return a 1-dimensional array with the count of each feature for that instance. The features must be returned in the same order they appear in feature_list so that we will know the values in one column correspond to the same feature for all instances!
Check-in. If everything is working correctly, you should now be able to do the following:
>>> this_instance.to_vector(['and', 'types', 'system_are_activated'])
array([5, 1, 1])
>>> labels, X = lexelt.get_features()
>>> labels
['!', '%', "'", "'d", "'ll", "'m", "'re", "'s", "'ve", '(', '(_directly_activated', '(_injunctions_activated', '(_tonically_activating', ')', ')_is_activated', ..., 'zero', 'zipper', 'zoom']
>>> X
array([[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.]])
>>> X.shape
(228, 7074)
The data that we are working with occasionally has instances labeled with multiple correct answers. That’s a challenge to work with, so we’re going to simplify the problem and keep only the first listed answer for each instance. Like the X array you created above, you’re going to need the complete set of possible answers first. Once you have that, you’re going to map each answer to an integer and then store those integers in a one-dimensional numpy array y, such that the first entry in the array y is the target for the first vector in X.
Write a get_targets function that returns a (sorted) list of answer labels along with a y array with the correct label for each instance:
>>> answer_labels, y_arr = lexelt.get_targets()
>>> answer_labels
['38201', '38202', '38203', 'U']
>>> y_arr # matey
array([0, 0, 2, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
0, 2, 0, 3, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
1, 2, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 2, 0, 0,
0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2,
0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 1, 1, 0, 0, 0, 0, 1])
The two functions you just wrote above (get_features and get_targets) were designed to take new training data and make the X and y arrays needed for training.
Once your classifier is trained, you’ll need to pass in test data into the classifier. And we’ll need to convert the test data into feature vectors, too. Fortunately, you’ve written almost all of the code needed to handle the test data. The one important thing is that when you read in the test data, you have to use the feature index that you already made. For any test feature that you haven’t seen in the training data, just throw it away. Your classifier doesn’t know any information about how to use that new feature anyway, so just ignore it.
You will need to modify your get_features and get_targets functions to each take an optional labels argument so that they handle unseen data properly, but it should be a quick fix! For get_features, you should ignore features that aren't in your feature labels, and in get_targets, you should return -1 in your test array if the target sense wasn't observed in the training data.
Check-in. When you’re done, you should be able to do the following:
>>> test_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test', 'r')
>>> test_data = get_data(test_fp)
>>> testkey_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test.key', 'r')
>>> get_key(testkey_fp, test_data)
>>> test_lexelt = test_data['activate.v']
>>> _, Xtest = test_lexelt.get_features(labels)
>>> Xtest
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 3., ..., 0., 0., 0.]])
>>> _, Ytest = test_lexelt.get_targets(answer_labels)
>>> Ytest
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0])
Supervised decision list classifiers are trained to recognize the features that are high-quality indicators of a particular sense. Once trained, the classifier can be used to label unknown senses in test data.
At this point, you should have already extracted the features from training data and stored them in X, and extracted the targets and stored them in y as described above.
First, let’s look at a simplified version of the problem where you are trying to decide between only two senses. When a classifier only has to decide between two choices, it is called a binary classifier. In a binary classifier, you either need to decide if an unknown head word should be labeled as sense_1 or sense_2.
For each feature f, which could be a word in the bag of words or a collocation, you compute the log-odds ratio of the probabilities of each sense given the feature to see how good that feature is in distinguishing between the two senses:
A high magnitude score (e.g. +10 or -10 vs +.02 or -.02) for feature f indicates that the feature is good at distinguishing the senses. A low magnitude score indicates that the feature is not good at distinguishing the senses. A positive score means that sense_1 is more likely; a negative score means that sense_2 is more likely.
Probabilities can be estimated using maximum likelihood estimation, which uses information about how many counted observations we have of outcomes. In this case, we're going to sum over the number of times we see each feature in the context of a given sense:
Since you’re going to divide the two probabilities by one another and their denominators will be the same, the denominators cancel each other out, leaving us with:
Note that these counts are looking at each time a feature shows up: if a feature is present twice in a given sense_1 instance, that will add 2 to our count in the numerator.
In some cases, the numerator or denominator might be zero, so you will use Laplace smoothing (it’s easy!) but instead of +1, you will use +α smoothing, adding α=0.1 to both the numerator and denominator in order to avoid division by zero errors, and to avoid taking the log of 0.
In the lexical sample data you have, you can’t directly use the formulation above because you don’t have a binary classification: you have to decide which sense label to assign when there might be many more than 2 choices. Therefore, you don’t use the formulation shown above. Rather, you use the following modification for each sense:
Like before, high positive values will be strong indicators of sense_1. Low positive values will be weak indicators of sense_1. Negative values will be indicators that it is not sense_1, but since we don’t classify words as not being a particular sense, your decision list classifier won’t make use of negative scores.
Similar to what you showed above, you can rewrite the ratio of probabilities as follows:
Finally, you will use Laplace smoothing to help handle possible zeros:
You will fill in the implementation of a DecisionList class in DecisionList.py that conforms to the scikit-learn API.
To match scikit-learn, the class has two methods: fit(X,y), which will be used to train the classifier, and predict(X) which will be used to predict senses in the test data. The constructor (the __init__ method) for the DecisionList will take three optional parameters:
default_target gives the fall-back class the classifier will predict if it doesn’t have any information matching an instance,
alpha determines the value to be used for Laplace smoothing. Set alpha to 0.1 if it is not specified.
min_score will be used later to say something about how confident the classifier should be about a rule it learns before it applies that rule. Set min_score to 0 if it is not specified.
Your fit(X,y) method will take two parameters: X, the numpy array of feature vectors for the training data, and y, the numpy array of targets from the training data.
To train a decision list from the features (of a single lexelt!), you will first calculate a score for each (feature f, sense i) pair. For some senses, a particular feature might be a good indicator; but for another sense of the same word, it will not be a good indicator. Sort these pairs so that the (feature f, sense i) pair with the highest positive score is first. Throw away any scores that are less than min_score. Save the resulting list of pairs as a rules attribute of the DecisionList object.
To make this code easier to read (and debug), write a helper member function get_score(self, feature, label, X, y) that takes a feature, a label, a feature array X, and a label array y, and returns a score for that combination of feature and label. You should use the Laplace smoothed score from above in this function.
The fit method will fit your model without returning anything. All of the state of the decision list will be stored in the object.
Check-in. Verify that your fit function is working:
>>> from DecisionList import DecisionList
>>> d = DecisionList(alpha=0.1, min_score=5, default_target=5)
>>> d.get_score(0, 0, X, Y)
2.471305718925589
>>> d.get_score(0, 1, X, Y)
-6.149747119504682
>>> d.get_score(42, 0, X, Y)
3.4594316186372978
>>> d.fit(X, Y)
>>> d.rules[:10]
[(7067, 0), (4077, 0), (4047, 0), (1218, 0), (878, 1), (4482, 0), (3326, 0), (5218, 0), (1800, 0), (4980, 0)]
Now that your classifier is trained, you’re ready to use it to predict labels for new instances.
Your predict(X) method will take an array of feature vectors representing the test data.
For each instance in the test data (a single row in the matrix X), you walk down the .rules decision list until you find a matching (feature f, sense i) pair. Since the decision list is sorted from most predictive to least predictive, you will end up classifying your instance based on the presence of the best (highest-scoring) matching feature. As you walk down the (feature f, sense i) decision list, if an instance contains feature f, you will label it as sense i. If it does not, you will proceed to the next (sense, feature) pair. Continue this process until you have exhausted all features in your decision list.
If none of the features in the decision list match, the classifier still needs to label the instance with some default classification. We will pass that value in to the DecisionList constructor through the default_target parameter. Since we have already seen that most frequent sense is a good baseline classifier, we will use the MFS for each word as the default sense when we build the DecisionList classifier. (Once again, we're making a different classifier for each lexelt, so you should only be passing in the unique MFS based on the scheme you used to tie-break earlier.)
Your predict(X) method will return a one-dimensional numpy array containing the classification. The resulting numpy array should have the same format as the array y that you passed into the fit method.
To break the work of predict down into smaller (more testable!) pieces, write a helper method predict_one that takes a feature vector for a single instance as input, and returns a predicted label for that instance. Then predict can call predict_one on each row of the X array:
>>> d.predict_one(Xtest[0])
0
>>> d.predict_one(Xtest[101])
1
>>> d.predict(Xtest)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
A final reminder: you need to create a separate decision list for each lexelt (e.g. one decision list for bank.n, another for activate.v, etc.)
Continue to put your answers to these questions in analysis.md.
Implement the decision list described above, apply it to the test data, and report your accuracy across all instances of all lexelts. (The DecisionList.py main function already has code to help you do this!) Does your decision list outperform the MFS baseline?
Examine the rules in the decision list for activate.v. What problems do you notice in the rules you have generated? Describe and comment on several examples.
Choose two (or more) of the following and report your results using clear data presentation (i.e., tables and plots with a text summary of what they show). Provide some discussion: were you surprised by the result? Why do you think you got that result?
A. Use Tfidf to downweight common features. Assuming your training matrix is called X_train and your test matrix is called X_test:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer.fit(X_train)
>>> X_train = transformer.transform(X_train).toarray()
>>> X_test = transformer.transform(X_test).toarray()
Comment on the impact of the transformation on your results. Does it help? Why do you think that is?
B. How does your choice of min_score affect your results? Try a range of possible values, and report trends that you find.
C. What if you used another classifier instead of your decision list classifier? If you followed the rules of the scikit-learn API, this question is a piece of cake! This example uses the GaussianNB classifier but you can try any classifier in scikit-learn instead. To do so, simply import the classifier you want (e.g. from sklearn.naive_bayes import GaussianNB). Now find the line where you created your DecisionList object. It probably looks something like this in your code:
my_classifier = DecisionList(alpha=alpha, default_target=mfs)
Simply replace that line with one of the classifiers from scikit-learn:
my_classifier = GaussianNB()
You shouldn’t have to change anything else! Your code will now run using the classifier you chose from scikit-learn. Play around and let us know what you discover!
D. Is there a correlation between the score a rule has and the accuracy of the prediction? Do high scoring rules work well? Do low scoring rules work equally well? A nice visualization could show a bar chart of the accuracy of the rules (y-axis) plotted against the score of the rule (x-axis). You’re welcome to come up with whatever you’d like here – we’re interested to see what you find!