Lab 5: Senseval

Starter Code

Get the starter code on GitHub Classroom.

FREEING UP COMPUTER SPACE

Assuming you're done with Lab 4, you should probably delete the word embedding files from last week (the .npy and glove/ directories) to free up space. If you're finding you're still low on space, you can clear out all your Docker images (using docker system prune -a ) and then to re-pull the Docker image using the Docker instructions. Note that this will clear all of the images, including those for other classes, so please tidy with care. You can check the documentation for docker prune if you want to do something fancier.

Introduction

In this lab you will investigate the English Lexical Sample Task (Task 6) data taken from Senseval-3 in 2004.

History: Senseval used to be a regular workshop for a set of shared tasks, in which a group of organizers would help prepare and release a shared dataset and groups from different intuitions would compete to see who could design the system with the highest performance for that task. Conference-affiliated shared tasks and government-sponsored "bakeoffs" (where you could receive a federal grant for computing resources and research travel in exchange for participating in a shared task like this) have been a significant anchor for the NLP community over the past several decades. Benefits of these tasks are shared high-quality datasets that can continue to be used as a benchmark, increased interest and hype around challenging problems that might otherwise be intimidating to approach, clear channels of funding for work on hard-to-tackle problems, and an invitation for cross-institution interaction and collaboration with other researchers in your specific subfield. (If you want to see a very meta paper about this history, here's a 2012 analysis of the history of NLP using topic models.)

For today, you'll build a supervised decision list classifier, trained on the provided training data and tested on the provided test data.

Answers to written questions should be added to analysis.md, which should be converted into a pdf (analysis.pdf) for your final submission using pandoc. If you need help doing this because you don't have pandoc installed, you can use Docker to help.

Examining the data

The data provided by the Senseval-3 competition is in /cs/cs159/data/senseval3/. Training and testing data are in separate subdirectories. Before you begin work on the lab, you need to spend a few minutes looking at the source data. In particular, examine the structure of the files /cs/cs159/data/senseval3/train/EnglishLS.train and /cs/cs159/data/senseval3/test/EnglishLS.test. (If you are SSH'ed into knuth and not sure which application to use to look at these files, you can open these files using less.)

When you look through the training data file, you will see the text <lexelt item="activate.v">. This indicates that until the next lexelt tag, all of the training examples are for the ambiguous verb “activate”. There are 228 training examples provided for the lexelt activate.v. Each training example, called an instance, is given an instance id. For example, the second instance has the id activate.v.bnc.00044852:

For neurophysiologists and neuropsychologists , the way forward in understanding perception has been to correlate these dimensions of experience with , firstly , the material properties of the experienced object or event ( usually regarded as the stimulus ) and , secondly , the patterns of discharges in the sensory system . Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's nineteenth - century doctrine of specific energies formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated , sensations specific to those organs are experienced . It was proposed that there are endings ( or receptors ) within the nervous system which are attuned to specific types of energy , For example , retinal receptors in the eye respond to light energy , cochlear endings in the ear to vibrations in the air , and so on .

</context>

</instance>

We can see from the instance id that the lexelt we're interested in for instance 00044852 (which came from the BNC corpus) is the verb "activate". It has two valid senses here, 38201 and 38202, indicating the annotators either disagreed on the correct sense or agreed that more than one sense applied. After the instance id and answer are provided, the there is a short piece of text labeled with the tag <context>. This text contains the word “activate” in some form (here, "activated") and provides a few sentences worth of context around the word to help you determine which sense is the correct sense. In addition, the context indicates the word that you are trying to disambiguate, as highlighted above: <head>activated</head>. It might seem strange to label the “head” word like this, but there are a few good reasons to do so:

Sometimes the word in the context doesn't have the same surface form as the lexelt. In the above instance, the head word is “activated”, not “activate”; in other cases, it could be "activate" or "activates".
It is possible that the word you are trying to disambiguate appears multiple times with different senses in the same piece of text (think "I'd bank on my bank being open"), so marking the head word lets you know which of the words you should be determining the sense of.
Finally, sometimes there are multiple head words marked in the same context, indicating that those words all share the same sense.

Spend some time looking through the training data file to see if you can find examples with differing surface forms, multiple head words, multiple senses, and so on. In addition, some instances are labeled with the sense “U”, meaning “Unclear”. This means that the annotators weren’t clear which sense to assign to the word. There are even some cases where the word was given two sense labels but one of them as a “U”, meaning that some of the annotators agreed on one sense but other annotators were unclear on how to assign a sense to it.

To accompany this file is a different file, /cs/cs159/data/senseval3/train/EnglishLS.train.key that serves as the answer "key" for the data. This has one line per training instance with the corresponding valid sense label(s). To match the second instance above, it has the following second line in the file:

activate.v activate.v.bnc.00044852 38201 38202

This might seem a little redundant with the training data (which contains annotations of which senses are valid), but when we start using the test data, we won't include the <answer> tags in the instances, so we'll need the file to be separate. To keep things consistent between train and test data processing, our code will extract the correct senses from the file above, not from the original training file.

1.1 Parsing the data

The source data for the Senseval task is tricky to parse correctly, so the starter code includes a file, Lexelt.py, that will do this parsing for you. Before you add to the file, make sure that you understand it, and that you can use it directly in the Python 3 interpreter. Here is a sample of what commands you can use with these data:

Loading the data

First, we will load all of the training data. This will give us a dictionary of dictionaries, indexed by unique LexElts:

$ python3

>>> from Lexelt import get_data, get_key

>>> train_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train', 'r')

>>> train_data = get_data(train_fp) # This will take a few seconds to load in

>>> train_data.keys() # Find all the unique lexelts

dict_keys(['activate.v', 'add.v', 'appear.v', ..., 'write.v'])

>>> train_data['smell.v'].keys() # Find all training example ids for the verb 'smell'

dict_keys([['smell.v.bnc.00006854', 'smell.v.bnc.00006855', 'smell.v.bnc.00018122', 'smell.v.bnc.00027848', 'smell.v.bnc.00029806', 'smell.v.bnc.00032140', 'smell.v.bnc.00033773', 'smell.v.bnc.00037068', ... ,'smell.v.bnc.00687246']])

Pulling information for an instance

The earlier code will have made an object of class LexEltInstance for each of the instances in the training data. This will allow us to access different information about the instance:

>>> this_instance = train_data['smell.v'].get('smell.v.bnc.00018122') # Select a specific instance

>>> this_instance

<Lexelt.LexEltInstance object at 0x7f47a2624fd0>

>>> print(' '.join(this_instance.words)) # See the text of the instance

Used to speed the germination of seeds and potatoes , said the inspector , and as a cleaning solvent . And it 's odourless , chimed in Constable Bewman helpfully . So there was no need to have anything highly scented or smelling strongly on the table , said Henry at once . I hope you never take it into your head to commit a murder , sir , said the Inspector . You do seem to have an eye for essentials .

>>> heads = this_instance.heads # Extract which token(s) in the example are the head word

>>> heads

[42]

>>> this_instance.words[heads[0]]

'smelling'

>>> this_instance = train_data['smell.v'].get('smell.v.bnc.00006855')

>>> this_instance.heads # Unusually, this one has two!

[19, 25]

Finding sense labels

As we mentioned before, we're pretending we don't have access to the labels in the training data when we first load it (so in the code above, the "answers" attribute of each LexEltInstance is None. If we want to add information about the instances to our data, we'll pull from our "answer key" using the get_key function, which will fill in the answers as a list for each instance.

>>> trainkey_fp = open('/cs/cs159/data/senseval3/train/EnglishLS.train.key', 'r')

>>> get_key(trainkey_fp, train_data)

>>> train_data['smell.v'].get('smell.v.bnc.00018122').answers

['3893505']

>>> train_data['smell.v'].get('smell.v.bnc.00006855').answers

['3893503', 'U']

1.2 Exploring the training data

This section asks you to write some helper functions that will make it easier to answer the questions in the following section. All of these functions should be new member functions of the LexElt class.

pos(self): Returns the part of speech of a LexElt object.
num_headwords(self): Returns a list with the number of headwords in each instance of a LexElt object.
num_answers(self): Returns a list with the number of answers in each instance of a LexElt object.
get_all_senses(self): Returns a single list containing all of the senses of all of the instances of a LexElt object. Excludes U senses. Note that these can be listed either as numerical IDs or longer codes, depending on the data source (Wordsmyth uses just an integer ID like 38201, while WordNet expresses metadata in the sense ID like organization%1:04:02::)
count_unique_senses(self): Returns the number of unique senses used by at least one of the instances of a LexElt object. Excludes U.
most_frequent_sense(self): Returns the sense that is associated with the largest number of instances of a LexElt object (again, excluding U). Ties may be broken arbitrarily (that is, any way you like).

Check in. Confirm that your functions work before moving on:

>>> lexelt = train_data['activate.v']

>>> lexelt.pos()

'v'

>>> lexelt.num_headwords()

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

>>> lexelt.num_answers()

[1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1]

>>> lexelt.get_all_senses()

['38201', '38201', '38202', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38202', '38202', '38201', '38203', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38203', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38202', '38201', '38201', '38201', '38202', '38201', '38201', '38202', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38202', '38203', '38203', '38202', '38204', '38202', '38202', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38202', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38202', '38204', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38203', '38201', '38203', '38203', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38201', '38202', '38201', '38202', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38201', '38202', '38202', '38202', '38202', '38202', '38201', '38202', '38201', '38202', '38202', '38201', '38201', '38201', '38202', '38201', '38202', '38202']

>>> lexelt.count_unique_senses()

>>> lexelt.most_frequent_sense()

'38201'

1.3 Questions

You should write code to print answers to the following questions in the main function of Lexelt.py, and include your answers in your analysis.md file. The first question has been answered for you in the starter code. Note that all of these questions should only require the training data; don't touch the test data yet!

How many different “lexelts” (lexical elements) are there? The lexelts represent the polysemous word types that you are being asked to disambiguate.
What is the breakdown of part of speech (verbs, nouns and adjectives) of these lexelts? Remember, you can determine the part of speech by looking at the suffix attached to the lexelt: "activate.v" is a verb, "shelter.n" is a noun, and "hot.a" is an adjective.
How many instances (training examples) are there for the lexelt "organization.n"?
The introduction to the lab mentioned that an instance can have multiple head words and how an instance can have multiple answers. Make a small table showing the breakdown on the number of head words per instance. Make another table showing the breakdown on the number of answers per instance. Don’t break this down per lexelt - just one small table that summarizes all of the data.
How many senses of "activate.v" are represented in the training data? (Do not count U as a sense for this question.)
One common baseline for this task is to assume that you guess uniformly at random among the possible senses (that is, with equal probability for each sense). However, rather than actually making random guesses and seeing how well you do, you can just assume that you will get 1/n of the instances for a particular lexelt correct, where n is the number of senses for that lexelt. For example, if there 5 senses for a lexelt w, the random baseline will get 1/5=20% of the instances of the lexelt w correct. It doesn’t matter if the number of instances for w is not a multiple of n: if there are 37 instances for a lexelt with 5 senses, you will say that a random guess will give you 37/5=7.4 of them correct.
The question for you is as follows: what percentage of the instances in the training data would you be able to label correctly if you just guessed randomly? Two important notes:
- You should ignore U when counting how many senses a lexelt has for this question.
- This question should not require you to actually look at the correct sense labels in answers for any examples. It doesn’t matter what the actual answers are: you are just trying to get an estimate for how many you would expect to get right if you just guessed randomly.
Another common baseline is the most frequent sense (or MFS) baseline. In the MFS baseline, you find the most frequent sense for each lexelt and the label every instance with that most frequent sense. What percentage of the instances in the training data would you label correctly using the most frequent sense baseline? For this question, when you are computing which sense is most frequent, you should use all of the listed senses for all of the instances of a lexelt. However, when you test your accuracy, you should assume that only the first sense listed in answers is the correct label for an instance.
Is the MFS baseline higher or lower than you expected? Discuss why you think that might be: what does that imply about the number of senses and the distribution over senses for words?

1.4 Exploring the test data

Now, we'll start to use our functions to inspect the test data. Note that we're breaking one of the cardinal rules, which is "don't touch the test data until your model is ready to go". In this case, though, we want to get some intuition about ways that our training and test data might be different.

Answer the following questions in your writeup:

How many instances of lexelt activate.v are there in the test data?

There are two similar questions that are not repeats of one another:
1. Are there any senses of organization.n that are represented in the training data but not in the test data? Assuming the answer is yes, would this cause predictions based on the training data to perform poorly? Discuss briefly.
2. Are there any senses of organization.n that are represented in the test data but not in the training data? Assuming the answer is yes, would this cause predictions based on the training data to perform poorly? Discuss briefly.
For each lexelt in the training data, find the most frequent sense (MFS). Do the same for the test data. How many (and what percentage of) lexelts have the same most frequent sense in the training data and the test data? You can choose how to handle ties, but you must explain in your answer what method that you used. Discuss the implications of any disparities you found between the MFS in the training and the MFS in the test.
Use the MFS computations from the training data to predict the labels for the test data. What is your accuracy labeling the test data using the MFS sense baseline? Compare to how the random baseline and the MFS baseline performed when applied to the training data. How does this match your intuition? Discuss.

Note that if an instance is labeled with U, then U is the correct labeling for the sense. The fact that MFS doesn't ever generate U doesn't change the fact that the label U is the correct one, so we won't give it sympathy points: if you predict an actual sense label (e.g., '38201' or 'organization%1:04:02::') but the true label is U, that counts as an incorrect label.

2. Feature extraction

In this week’s lab, you will be writing a decision list classifier. Later in the semester, you will use scikit-learn, which has many prebuilt classifiers for you to use. One of your goals this week is build your decision list classifier with an interface similar to that of the classifiers in scikit-learn so that you can compare the performance of your classifier against other classifiers you could have used, and so that you start to become familiar with the scikit-learn API.

One very important thing to keep in mind as you work through this lab is that your classifier is not going to learn all of the lexelts at once. Rather, you will train a classifier on each lexelt (e.g. activate.v), then use that classifier to label the test instances for that lexelt. You can then throw that classifier away and train a new classifier on the next lexelt (e.g. add.v) and use that classifier to label the test instances for that lexelt, looping through the list of lexelts.

2.1 Feature vectors and labels

In order to train a supervised classifier, you will use your training data from before. For this task, as you saw in looking through the data earlier, you are provided with multiple instances for each lexelt. You will represent each distinct instance as a feature vector. A feature vector is a numerical representation for a particular training instance. Each feature vector in the training data will have an accompanying target, a numerical representation of the sense that the instance was labeled with from the answer key.

In order to conform with the scikit-learn interface, you will put all of your feature vectors (for a single lexelt, e.g. 'activate.v') into a numpy array, where each row of the array is an individual feature vector and each column refers to a distinct feature. In the scikit-learn documentation, this array of feature vectors is denoted by the variable X. If there are n training examples and m features, X will be an n ×m array, with one row for each of the n examples and one column for each of the m features. Similarly, you will load the targets (for a single lexelt) into a 1-dimensional numpy array. The scikit-learn documentation refers to this array of targets as y. Each element y[i] is the target (often called the class label) for the ith training example in X, indexed from 0.

Feature extraction

For the word sense disambiguation problem, you will extract two types of features:

bag of words features
collocational features

Bag of words features are simply the count for each word within some window size of the head word (that is, within a range of some number of tokens away from the head word). For this lab, we will set the window size large enough to include all of the tokens in the context. Don’t throw away tokens like punctuation, numbers, and symbols: if your classifier finds those tokens useful to distinguish senses, you'll want to make sure they're available to use. If they are just noise, your classifier should be able to ignore them without you needing to remove them from the list of features.

For example, for instance id activate.v.bnc.00044852, the bag of words features would show that the token "receptors" appeared 4 times in the context, the token "(" appeared 2 times, and the token "dimensions" appeared 1 time.

Collocational features are the n-grams that include the head word and their counts. For this lab, you will extract the bigrams and trigrams that include the head word. If there are multiple head words in a single context, you will extract bigrams and trigrams for each one. The instance above, activate.v.bnc.00044852, contains the following text (where activated is the head word):

… upon which parts of the sensory system are activated : stimulation of the retinal receptors …

The collocational features in this instance are the bigrams “are activated” and “activated :”; the trigrams “system are activated”, “are activated :”, and “activated : stimulation”. (Again, you can see we are not removing punctuation or non-alphabet symbols!)

You can represent these n-grams as a single “word” by joining the tokens in the n-gram together using underscores, such as "system_are_activated". (This might seem hacky, but seriously, this is what NLP researchers often do.) Just like with bag-of-words features, you’ll want to keep track of the count of each of these n-grams. In this case, the collocation “system_are_activated” appears 1 time.

Add following member functions to the LexEltInstance class:

bow_features(self) should return a Counter object over all of the words in the instance.
colloc_features(self) should return a Counter object over all of the collocational features (bigrams and trigrams) for the instance.
make_features(self) should combine the output of bow_features and colloc_features into a single Counter, and should store that Counter as a features attribute of the LexEltInstance object.
get_feature_labels(self) should return all of the keys of the object’s features attribute.

You are welcome to write additional helper member functions; perhaps having separate functions to find bigrams and trigrams would be helpful here?

Check-in. Now you should be able to interact with a LexEltInstance like this:

>>> lexelt = train_data['activate.v']

>>> this_instance = lexelt.get('activate.v.bnc.00044852')

>>> this_instance.make_features()

>>> this_instance.features

Counter({'the': 20, 'of': 14, ',': 10, 'to': 7, 'in': 6, 'are': 6, 'and': 5, 'experience': 4, 'system': 4, '.': 4, 'receptors': 4,...})

>>> this_instance.get_feature_labels()

dict_keys(['For', 'neurophysiologists', 'and', 'neuropsychologists', ',', 'the', 'way', 'forward', 'in', 'understanding', 'perception', 'has', 'been', 'to', 'correlate', 'these', 'dimensions', 'of', 'experience', 'with', 'firstly', 'material', 'properties', 'experienced', 'object', 'or', 'event', '(', 'usually', 'regarded', 'as', 'stimulus', ')', 'secondly', 'patterns', 'discharges', 'sensory', 'system', '.', 'Qualitative', 'Aspects', 'Experience', 'The', 'quality', 'modality', 'depends', 'less', 'upon', 'energy', 'reaching', 'nervous', 'than', 'which', 'parts', 'are', 'activated', ':', 'stimulation', 'retinal', 'receptors', 'causes', 'an', 'light', ';', 'inner', 'ear', 'gives', 'rise', 'sound', 'so', 'on', 'Muller', "'s", 'nineteenth', '-', 'century', 'doctrine', 'specific', 'energies', 'formalized', 'ordinary', 'observation', 'that', 'different', 'sense', 'organs', 'sensitive', 'physical', 'world', 'when', 'they', 'stimulated', 'sensations', 'those', 'It', 'was', 'proposed', 'there', 'endings', 'within', 'attuned', 'types', 'example', 'eye', 'respond', 'cochlear', 'vibrations', 'air', 'are_activated', 'activated_:', 'system_are_activated', 'are_activated_:', 'activated_:_stimulation'])

2.2 Storing features in a numpy array

Now that you know what features to extract, you need to store them in a numpy array. To do that, you first need to know the complete set of features that you have (for one lexelt!). The first step, therefore, is to go through each instance of a lexelt and collect the features of them.

Write a member function of the LexElt class called get_features(). This function should do the following:

Call make_features for all of the LexElt object’s instances
Build a list (sorted in lexicographic, i.e., alphabetical order) of all of the features that show up at least once in at least one instance for the LexElt object.
Create a numpy array the right size to hold a count for each of the features (columns) for each of the instances (rows).
Pass the list of feature names to a new member function of the LexEltInstance class, to_vector, that will return a 1-dimensional numpy array.
Store the array from to_vector in the right row of the feature array.

get_features should then return both the list of feature labels and the numpy array with all of the features for all of the instances.

The to_vector(self, feature_list) function in the LexEltInstance class should return a 1-dimensional array with the count of each feature for that instance. The features must be returned in the same order they appear in feature_list, as we want to make sure all of our instances are consistent about which column refers to which feature!

Check-in. If everything is working correctly, you should now be able to do the following after running the code from 2.1:

>>> this_instance.to_vector(['and', 'types', 'system_are_activated'])

array([5, 1, 1])

>>> labels, X_train = lexelt.get_features()

>>> labels

['!', '%', "'", "'d", "'ll", "'m", "'re", "'s", "'ve", '(', '(_directly_activated', '(_injunctions_activated', '(_tonically_activating', ')', ')_is_activated', ..., 'zero', 'zipper', 'zoom']

>>> X_train

array([[1., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 1., ..., 0., 0., 0.]])

>>> X_train.shape

(228, 7074)

2.3 Storing answers in a numpy array

The data that we are working with occasionally has instances labeled with multiple correct answers. That’s a challenge to work with, so we’re going to simplify the problem and keep only the first listed answer for each instance. Like the X array you created above, you’re going to need the complete set of possible answers first. Once you have that, you’re going to map each answer to an integer and then store those integers in a one-dimensional numpy array y, such that the first entry in the array y is the target for the first vector in X.

Write a get_targets function that returns a (sorted) list of answer labels along with a y array with the correct label for each instance:

>>> answer_labels, y_train = lexelt.get_targets()

>>> answer_labels

['38201', '38202', '38203', 'U']

>>> y_train

array([0, 0, 2, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,

0, 2, 0, 3, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,

1, 2, 2, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 2, 0, 0,

0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2,

0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0,

0, 1, 1, 0, 0, 0, 0, 1])

2.4 Prepping your test data

The two functions you just wrote above (get_features and get_targets) were designed to take new training data and make the X and y arrays needed for training.

Once your classifier is trained, you’ll need to pass in test data into the classifier. And we’ll need to convert the test data into feature vectors, too. Fortunately, you’ve written almost all of the code needed to handle the test data. The one important thing is that when you read in the test data, you have to use the feature index that you already made. This is to make sure that if we use 3 as the index to a particular sense in the training data, or 14 as the column for feature X, we know that these indices will mean exactly the same thing in the test data. For this classifier, we will throw away any feature in the test data that we haven’t seen in the training data. Your classifier doesn’t know any information about how to use that new feature anyway, so there's no harm done if ignore it. We're also going to keep features and senses from the training data even if they don't appear in the test data at all.

Modify your get_features and get_targets functions to each take an optional labels argument so that they handle unseen data properly. This should be a quick fix:

For get_features, you should ignore features that aren't in your passed-in feature labels list. However, you should still retain features in the passed-in labels even if they're not present in any of your instances.
In get_targets, you should return -1 in your test array if the target sense wasn't observed in the training data. Otherwise, the index of the sense should be the index of that label in the passed-in labels list (whether or not all of those labels are used).

Check-in. When you’re done, you should be able to do the following:

>>> test_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test', 'r')

>>> test_data = get_data(test_fp)

>>> testkey_fp = open('/cs/cs159/data/senseval3/test/EnglishLS.test.key', 'r')

>>> get_key(testkey_fp, test_data)

>>> test_lexelt = test_data['activate.v']

>>> _, X_test = test_lexelt.get_features(labels)

>>> X_test

array([[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 0., ..., 0., 0., 0.],

[1., 0., 0., ..., 0., 0., 0.],

...,

[0., 0., 0., ..., 0., 0., 0.],

[0., 0., 3., ..., 0., 0., 0.]])

>>> _, y_test = test_lexelt.get_targets(answer_labels)

>>> y_test

array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,

0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,

1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0])

3. How supervised decision list classifiers work

Supervised decision list classifiers are trained to recognize the features that are high-quality indicators of a particular sense. Once trained, the classifier can be used to label unknown senses in test data.

3.1 Making decisions

At this point, you should have already extracted the features from training data and stored them in X, and extracted the targets and stored them in y as described above.

First, let’s look at a simplified version of the problem where you are trying to decide between only two senses. When a classifier only has to decide between two choices, it is called a binary classifier. In a binary classifier, you either need to decide if an unknown head word should be labeled as sense_1 or sense_2.

For each feature f, which could be a word in the bag of words or a collocation, you compute the log-odds ratio of the probabilities of each sense given the feature to see how good that feature is in distinguishing between the two senses:

A high magnitude score (e.g. +10 or -10 vs +.02 or -.02) for feature f indicates that the feature is good at distinguishing the senses. A low magnitude score indicates that the feature is not good at distinguishing the senses. A positive score means that sense_1 is more likely; a negative score means that sense_2 is more likely.

Probabilities can be estimated using maximum likelihood estimation, which uses information about how many counted observations we have of outcomes. In this case, we're going to sum over the number of times we see each feature in the context of a given sense:

Note that these counts are over instances of features, not instances of senses! In other words, when we count features present in a particular sense, we should think of our count as looping over feature tokens for f instead of looping over instances. If a feature is present twice in a given sense_1 instance, that will add 2 to our count in the numerator (one for each time that feature was present in that sense).

Since you’re going to divide the two probabilities by one another and their denominators will be the same count of how many times feature f appeared, the denominators cancel each other out, leaving us with:

This is the base intuition for our scoring function: the ratio of the count of one sense to the count of another. However, since we may have more than two senses and since features may be absent, we're going to think back to Lab 3 and modify this a bit to make it effective for our situation.

Smoothing

In some cases, the numerator or denominator might be zero, so you will use Laplace smoothing (it’s easy!) but instead of +1, you will use +α smoothing, adding α=0.1 to both the numerator and denominator in order to avoid division by zero errors, and to avoid taking the log of 0.

Classification with more than two senses

In the lexical sample data you have, you can’t directly use the formulation above because you don’t have a binary classification: you have to decide which sense label to assign when there might be many more than 2 choices. Therefore, you don’t use the formulation shown above. Rather, you use the following modification for each sense:

Like before, high positive values will be strong indicators of sense_1. Low positive values will be weak indicators of sense_1. Negative values will be indicators that it is not sense_1, but since we don’t classify words as not being a particular sense, your decision list classifier won’t make use of negative scores.

Similar to what you showed above, you can rewrite the ratio of probabilities as follows:

Finally, you will use Laplace smoothing to help handle possible zeros:

4. Writing the decision list classifier

You will fill in the implementation of a DecisionList class in DecisionList.py that conforms to the scikit-learn API.

To match scikit-learn, the class has two methods: fit(X,y), which will be used to train the classifier, and predict(X) which will be used to predict senses in the test data. The constructor (the __init__ method) for the DecisionList will take three optional parameters:

default_target gives the fall-back class the classifier will predict if it doesn’t have any information matching an instance,
alpha determines the value to be used for Laplace smoothing. Set alpha to 0.1 if it is not specified.
min_score will be used later to say something about how confident the classifier should be about a rule it learns before it applies that rule. Set min_score to 0 if it is not specified.

4.1 Training the decision list classifier

Your fit(X,y) method will take two parameters: X, the numpy array of feature vectors for the training data, and y, the numpy array of targets from the training data.

To train a decision list from the features (of a single lexelt!), you will first calculate a score for each (feature f, sense i) pair. For some senses, a particular feature might be a good indicator; but for another sense of the same word, it will not be a good indicator. Sort these pairs so that the (feature f, sense i) pair with the highest positive score is first. Throw away any scores that are less than min_score. Save the resulting list of pairs as a rules attribute of the DecisionList object.

To make this code easier to read (and debug), write a helper member function get_score(self, feature, label, X, y) that takes a feature, a label, a feature array X, and a label array y, and returns a score for that combination of feature and label. You should use the Laplace smoothed score from above in this function.

The fit method will fit your model without returning anything. All of the state of the decision list will be stored in the object.

Check-in. Verify that your fit function is working using the data from activate.v you loaded in Section 2:

>>> from DecisionList import DecisionList

>>> d = DecisionList(alpha=0.1, min_score=5, default_target=5)

>>> d.get_score(0, 0, X_train, y_train)

2.471305718925589

>>> d.get_score(0, 1, X_train, y_train)

-6.149747119504682

>>> d.get_score(42, 0, X_train, y_train)

3.4594316186372978

>>> d.fit(X_train, y_train) # may take several seconds, but shouldn't take minutes

>>> d.rules[:10]

[(7067, 0), (4077, 0), (4047, 0), (1218, 0), (878, 1), (4482, 0), (3326, 0), (5218, 0), (1800, 0), (4980, 0)]

If your code is running slowly, it may be because your fit function is doing something inefficient, e.g. sorting or repeatedly searching long lists. Although the rules should be stored as a list, it's probably a good idea to use data structures that are fast to navigate (e.g. dictionaries or fixed-size numpy arrays) to store intermediate information while you're aggregating your feature counts.

If your values don't match, make sure to check through Section 3 to confirm you're following the instructions given for how to compute the final score function. In particular, make sure your sums include the count of each feature (that is, the string "I thought I tried that" includes two instances of the feature "I", each of which would add 1 to the count of that feature for the identified sense of try.v).

4.2 Using your decision list on test data

Now that your classifier is trained, you’re ready to use it to predict labels for new instances.

Your predict(X) method will take an array of feature vectors representing the test data.

For each instance in the test data (a single row in the matrix X), you walk down the .rules decision list until you find a matching (feature f, sense i) pair. Since the decision list is sorted from most predictive to least predictive, you will end up classifying your instance based on the presence of the best (highest-scoring) matching feature. As you walk down the (feature f, sense i) decision list, if an instance contains feature f, you will label it as sense i. If it does not, you will proceed to the next (sense, feature) pair. Continue this process until you have exhausted all features in your decision list.

If none of the features in the decision list match, the classifier still needs to label the instance with some default classification. We will pass that value in to the DecisionList constructor through the default_target parameter. Since we have already seen that most frequent sense is a good baseline classifier, we will use the MFS for each word as the default sense when we build the DecisionList classifier. (Once again, we're making a different classifier for each lexelt, so you should only be passing in the unique MFS based on the scheme you used to tie-break earlier.)

Your predict(X) method will return a one-dimensional numpy array containing the classification. The resulting numpy array should have the same format as the array y that you passed into the fit method.

To break the work of predict down into smaller (more testable!) pieces, write a helper method predict_one that takes a feature vector for a single instance as input, and returns a predicted label for that instance. Then predict can call predict_one on each row of the X array:

>>> d.predict_one(X_test[0])

>>> d.predict_one(X_test[101])

>>> d.predict(X_test)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

A final reminder: you need to create a separate decision list for each lexelt (e.g. one decision list for bank.n, another for activate.v, etc.)

4.3 DecisionList Questions

Continue to put your answers to these questions in analysis.md.

Implement the decision list described above, apply it to the test data, and report your accuracy across all instances of all lexelts. (The DecisionList.py main function already has code to help you do this!) Does your decision list outperform the MFS baseline?
Examine the rules in the decision list for activate.v. What problems do you notice in the rules you have generated? Describe and comment on several examples.
Choose two (or more) of the following and report your results using clear data presentation (i.e., tables and plots with a text summary of what they show). Provide some discussion: were you surprised by the result? Why do you think you got that result?

A. Use tf-idf to downweight common features. Assuming your training matrix is called X_train and your test matrix is called X_test:

>>> from sklearn.feature_extraction.text import TfidfTransformer

>>> transformer = TfidfTransformer(smooth_idf=False)

>>> transformer.fit(X_train)

>>> X_train = transformer.transform(X_train).toarray()

>>> X_test = transformer.transform(X_test).toarray()

Comment on the impact of the transformation on your results. Does it help? Why do you think that is?

B. How does your choice of min_score affect your results? Try a range of possible values, and report trends that you find.

C. What if you used another classifier instead of your decision list classifier? If you followed the rules of the scikit-learn API, this question is a piece of cake! This example uses the GaussianNB classifier but you can try any classifier in scikit-learn instead. To do so, simply import the classifier you want (e.g. from sklearn.naive_bayes import GaussianNB). Now find the line where you created your DecisionList object. It probably looks something like this in your code:

my_classifier = DecisionList(alpha=alpha, default_target=mfs)

Simply replace that line with one of the classifiers from scikit-learn:
my_classifier = GaussianNB()

You shouldn’t have to change anything else! Your code will now run using the classifier you chose from scikit-learn. Play around and let us know what you discover!

D. Is there a correlation between the score a rule has and the accuracy of the prediction? Do high scoring rules work well? Do low scoring rules work equally well? A nice visualization could show a bar chart of the accuracy of the rules (y-axis) plotted against the score of the rule (x-axis). You’re welcome to come up with whatever you’d like here – we’re interested to see what you find!

Lab 5: Senseval

Starter Code

FREEING UP COMPUTER SPACE

Introduction

Examining the data

1.1 Parsing the data

Loading the data

Pulling information for an instance

Finding sense labels

1.2 Exploring the training data

1.3 Questions

1.4 Exploring the test data

2. Feature extraction

2.1 Feature vectors and labels

Feature extraction

2.2 Storing features in a numpy array

2.3 Storing answers in a numpy array

2.4 Prepping your test data

3. How supervised decision list classifiers work

3.1 Making decisions

Smoothing

Classification with more than two senses

4. Writing the decision list classifier

4.1 Training the decision list classifier

4.2 Using your decision list on test data

4.3 DecisionList Questions