Lab 6: POS Tagging

Starter Code

Get the starter code on GitHub Classroom.

Introduction

This week, you’ll spend some time working with a Hidden Markov Model Part of Speech tagger (an HMM POS tagger if you like your TLAs, or Three Letter Acronyms). Rather than implementing it from scratch, you’ll make a series of modifications to the tagger. Along the way, you’ll analyze the impact of different configuration settings on the tagger’s performance.

As usual, you'll be writing the answers to periodic questions in the lab writeup in analysis.md, which you'll render as a PDF using pandoc.

As before, spaCy isn't working on knuth, so if you use knuth instead of Docker to complete labs, you might run into some issues here. As with Lab 3, we'll use a virtualenv to get around this. You can activate it by using the command

source /cs/cs159/tmp/python-virtualenvs/nlp/bin/activate

At this point, you should see (nlp) in your terminal prompt, which shows you're using the virtualenv. When you're done, you can turn the virtualenv off again with the command:

deactivate

Understanding the Starter Code

The starter code includes three files:

  • HmmTagger.py defines a class HMMTagger that implements a hidden Markov model (HMM) part of speech tagger. You can find most of the Viterbi implementation in the predict function, with one piece missing for you to fill in soon.

  • evaluate.py runs a part of speech tagger (like an HMMTagger) on a directory of data and calculates the tagger’s accuracy.

  • read_tags.py contains helper functions for loading directories full of texts labeled for part of speech.

The HMMTagger Class

The two main externally-facing functions of an HMMTagger object are train and predict. You can use an HMMTagger like this:

nlp = spacy.load("en_core_web_sm")


train_dir = "/cs/cs159/data/pos/wsj/train"

tagger = HMMTagger(nlp, alpha=0.1)

tagger.train(train_dir)


test_sentence = nlp("This is test input to the Part of Speech Tagger.")

tagger.predict(test_sentence)

print([token.tag_ for token in test_sentence])


To be consistent with the spaCy interface for tagger objects, you will also be able to access predict by calling an HMMTagger with a sequence of tokens:

nlp = spacy.load("en_core_web_sm")


train_dir = "/cs/cs159/data/pos/wsj/train"

tagger = HMMTagger(nlp, alpha=0.1, vocab_size=20000)

tagger.train(train_dir)


test_sentence = nlp("This is test input to the Part of Speech Tagger.")

tagger(test_sentence)

print([token.tag_ for token in test_sentence])


The HmmTagger.py file can also be run as a script from the command line. When you do that, it has the following interface:

usage: HmmTagger.py [-h] --dir DIR --output FILE [--alpha ALPHA]


Train (and save) hmm models for POS tagging


optional arguments:

-h, --help show this help message and exit

--dir DIR, -d DIR Read training data from DIR

--output FILE, -o FILE

Save output to FILE

--alpha ALPHA, -a ALPHA

Alpha value for add-alpha smoothing


Running HmmTagger.py will train an HMMTagger object on all of the files in dir, then save the (binary) model to output in pickled form:

python3 HmmTagger.py --dir /cs/cs159/data/pos/wsj/train --output model.pkl

Setting up HMMTagger.py

You will want to familiarize yourself with the HMMTagger class this week so you can modify it. If you open HMMTagger.py, you will see that seven of the member functions have TODOs for their docstrings. Start by adding docstrings for each of these functions to provide you clarity on what each is doing. You may also need to look at some of the previous functions to make sense of them. In addition, you'll see two matrices populated in the train function. One is the transition matrix, which describes the probability of going from one hidden state to another. The other is the emission matrix, which describes the probability of a particular observation given a specific hidden state. (If this feels fuzzy, Chapter 8.4.4-6 would be good to review and keep open for this lab). You should replace the TODO comment with information on which matrix is which.

Next, we're going to fill in one missing piece of code in the Viterbi algorithm: there's a TODO item in the middle of predict specifically to generate the code for the costs matrix so we can compute the log probabilities associated with transitions to each possible tag. You'll first see code that sets up the transition probabilities at the start of the sequence, when no tags or tokens have been observed. You're going to focus on the case where the sequence has started.

In the textbook, the probability of transitioning to state j at time t (that is, seeing tag j for the ith element in the sequence) is computed as

where i represents each possible choice of preceding tag at timestep t-1 (with N total tags in the tag set), vt-1 represents the probability of that tag in the previous timestep, aij represents the probability of transitioning from tag i to tag j, and bj represents the probability that tag j would emit the observed token ot seen at time t.

As usual, we'd prefer to work in log space, so we're going to populate the matrix costs with the specific log probability for each transition from i to j. (The code that follows will take care of computing the max and populating a table of "breadcrumbs" to help us find our solution.) Each entry is therefore going to have the following value:

Your goal is to translate that code above to something that can populate costs for each possible pair of tags: costs[i,j] should be the log probability of transitioning from state i to state j at the current token. You should avoid using for loops in this part: instead, use numpy operations to compute the values for all of the tags at once.

A few hints: first, all the existing code stores log probabilities instead of raw probabilities, so you shouldn't need to call the log function. Additionally, it's worth remembering that numpy can "broadcast" operations, so you can e.g. add a vector to every row:

>>> a = numpy.ones((3, 4))

>>> b = numpy.random.random(4)

>>> b

array([0.36427943, 0.49914536, 0.1735815 , 0.35321266])

>>> a + b

array([[1.36427943, 1.49914536, 1.1735815 , 1.35321266],

[1.36427943, 1.49914536, 1.1735815 , 1.35321266],

[1.36427943, 1.49914536, 1.1735815 , 1.35321266]])

You might find that you want to add something to each column instead of each row, in which case, a quick solution may be to transpose the matrix. Finally, keep in mind that your matrix is square (number of tags x number of tags), so it's possible it won't throw errors if you swap your rows and columns. It might be good to use a Python interpreter to play around with rectangular matrices to make sure you're doing operations the way you want to. When you think you're ready, you can test that your code works by submitting to Gradescope.

Exploring Tag Set Collapsing

Right now, our code is set up to use the Penn TreeBank (PTB) tag set. We're going to need to make some modifications to our tagger so it can use the Universal Tag Set, which is a much smaller tag set than PTB. In this section of the lab, you will analyze the impact of tag set granularity on tagging performance.

To start, open up read_tags.py. At the top of the file, there is a dictionary called universal_to_ptb. The keys of this dictionary are Universal Tags, while the values are tuples of PTB tags that correspond to that tag. For instance, the tag "VERB" in the Universal Tag Set encompasses the tags "MD" (for modal verbs) as well as a number of tags starting with a V representing different verb forms (e.g. non-3rd person, gerunds, etc). You can refer to our textbook reading for a reference of PTB tags and the Universal Dependencies page for Universal POS tags.

  • Add code to create a dictionary ptb_to_universal. This dictionary will have a key for each PTB tag, whose value will be the corresponding Universal tag. Values in this dictionary should just be strings, not lists of strings.

  • Using this dictionary, modify the parse_file function to convert tags from PTB tags to Universal tags if do_universal is True.

Now, we'll test out several different configurations of how we use our tag set.

  1. First, train an HMMTagger on the Brown Corpus brown data set (/cs/cs159/data/pos/brown) and test it on the Wall Street Journal wsj training data set (/cs/cs159/data/pos/wsj/train) using the Penn Treebank tags that are recorded in the data. Record your accuracy in a table in analysis.md. This will take a few minutes.

  2. Next, repeat the process in Step 1, but convert all of the tags from PTB to the Universal Tag Set before training/testing. To do that:

    • Add a new flag universal to the interface for the HmmTagger.py file in the Argument Parser. You should keep the existing default functionality if this flag isn't used, so have universal default to False by making its action="store_true". This will make it so calling the code with --universal will enable using this conversion, and not using the flag will default to the Penn TreeBank.

    • Add a new named parameter do_universal to the HMMTagger class’s initializer method. do_universal should default to False. Inside the HMMTagger object, save the new parameter as a data member called self.do_universal.

    • Ensure that the self.tags list initialized in HMMTagger now contains the Universal tag set instead of the PTB tag set.

    • Update the main() function in HmmTagger.py to pass the new command-line argument args.universal to the HMMTagger initializer.

    • Modify the train() function in HmmTagger.py to pass the value of self.do_universal to the parse_dir function. This will use the code you wrote for parse_file.

    • Add a universal flag to evaluate.py just like you did for HmmTagger.py.

    • Modify the main function of evaluate.py to pass the value of args.universal to parse_dir, just like you did for the training method of HMMTagger.

At the end of this process, report out the same information as you did for Step 1 in analysis.md.

  1. Finally, try one more configuration. This time, you’ll train on the full PTB tag set, but at evaluation time, you’ll map all of the tags to the universal tag set before evaluating them. To do that:

    • Modify the main function of evaluate.py so that after it calls tagger, but before it compares the tags, it changes each token’s tag_ attribute to the right Universal Tag Set tag. If you leverage what you have imported from read_tags.py, this should only require a couple of lines of code. Report out your evaluation results from this as well.

In analysis.md, describe how the performance of your tagger changed for each of the three configurations. What do the results say about tag granularity? Are the results surprising to you? Why or why not?

We might wonder if the accuracy is affected by a mismatch in the style of the data: maybe the Brown corpus is just very different from the WSJ corpus! Repeat the above three experiments, but using /cs/cs159/data/pos/wsj/train as your training data and /cs/cs159/data/pos/wsj/test as your testing data.

In analysis.md, comment on whether the results are consistent after the change in dataset. What, if anything, do you conclude from your results?

For the remainder of the lab, you should train using the full PTB Tag Set and test using the Universal Tag Set.

Exploring Vocabulary Size Effects

The starter HMM model adds all of the words in the training set to its vocabulary. For the brown data set, that means it has a vocabulary of 47,703 words, plus the <<OOV>> or "Out Of Vocabulary" token. (Replacing tokens that aren't in the vocabulary of a corpus with a placeholder is a common and useful convention in NLP!)

In this section of the lab, you will explore some of the time, space, and performance trade-offs that come from varying the size of the vocabulary.

First, modify the HmmTagger.py interface so that it can take a vocabulary size as a command-line argument. To do that, you should:

  • Add a --vocabsize, -v argument to the HmmTagger.py interface. The default value should be None, which will correspond to keeping all of the words in the vocabulary (in other words, the default behavior will be the same as the starter code behavior).

  • Modify update_vocab so that it only keeps the vocabsize most frequent words in the vocabulary, or all words if vocabsize is None.

Next, build six models. All of them should be trained on the Brown corpus using the PTB tag set. They should vary in their vocabulary size: 1000, 2000, 5000, 10000, 20000, or 50000. Note that since 50000 is greater than the number of different word types in the Brown corpus, it should keep all of the tokens in the vocabulary.

In your analysis.md, add a table that reports the following for each of the six models:

  • How long it takes to train the model /cs/cs159/data/pos/brown (in seconds)

  • How long it takes to test the model on the /cs/cs159/data/pos/wsj data (in seconds)

  • How big the model is (in kilobytes or megabytes, as appropriate), which you can find using ls -lh

  • The model’s accuracy

Note: The command-line time command can be used to time how long another program takes to run. For example, to see how long it takes to build model.pkl (note that $ is the prompt symbol, and the typed command starts at time)

$ time python3 HmmTagger.py -d /cs/cs159/data/pos/brown -o model.pkl


real 1m13.679s

user 1m12.865s

sys 0m0.487s

This output says that it took 1:13 in real (clock) time, 1:12 in processor time, and 0.487 seconds of system time to run HmmTagger.py on my laptop. You should report the user time (or processor time) for this lab.

Hint: To repeat the same command on several values, you can use a bash for loop:

for <var> in <sequence>; do <command>; done. For example, if I have files named file1.txt, file2.txt, file3.txt and file4.txt in my directory, but meant them to be saved as Markdown files instead, I could run:

$ for i in 1 2 3 4; do mv file${i}.txt file${i}.md; done

…to rename the files to file1.md, file2.md, file3.md, and file4.md.

After you build your table, comment on the patterns you observe in analysis.md.

Exploring Document Size Effects

In this last section, you will explore whether the length of test documents affects the POS tagger’s performance.

For this part, you should train on the Brown data with a vocabulary size of 20,000. You should test on the WSJ data. Train with the PTB tag set and evaluate with the Universal set.

Update evaluate.py so that it generates a scatter plot of document size (in tokens, on the x-axis) and tagger accuracy (as a percent, on the y-axis). To do that you, you should:

  • Update evaluate.py’s argparse interface to take a new argument --output, -o that will take the name of a file to write an image to. This argument should default to None so that if it’s not provided, the behavior of the script will stay the same as it was in previous steps.

  • Update main() so that if args.output is not None, a scatter plot is generated showing document length compared to accuracy. If there, generate a scatter plot in a .png file where x axis is document size and y axis is accuracy.

Note: To get a point for every file instead of all of the data together, you won’t be able to use parse_dir directly. Instead, look at the parse_dir code in read_tags.py for a model of the some of the code you’ll want to add to your main function.

Include your plot in analysis.md. Comment on any patterns you notice, and discuss why you think you might see those patterns.

EXTRA CREDIT: Exploring "Gender" Effects on POS Tag Results

One of the readings this week was this paper from Garimella et al., which describes author gender and POS tagger performance. For extra credit this week, you can also do a reproducibility study to see if you see the same patterns in our HMMTagger that Garimella et al. reported in their paper.

Continue to use a vocabulary size of 20,000, training on the PTB tag set and testing on the Universal tag set. You will try a variety of train data sets. You will test on WSJ, since that’s the data set that we have (automatically-predicted from names) author gender information for.

The /cs/cs159/data/pos/wsj/train and /cs/cs159/data/pos/wsj/test directories are both split into two subdirectories, female/ and male/, corresponding to the article labels from Garimella et al.’s data release.

Note: As noted in our discussion this week, these are not actual self-reported gender labels, but were instead gathered using a heuristic applied to the authors' first names. Though there are many reasons we'd expect this heuristic to have high accuracy on this particular dataset and others, I generally consider to be a in terms of the types of bias it can produce no matter what. Nonetheless, it still shows up from time to time, and I'm certainly guilty of having used this heuristic as a grad student; my paper now has a fairly extensive note on the front explaining the many ways this is messed up if you want to learn more.

Set the action of the argparse dir argument to "append". That way, you’ll be able to use the -d (or --dir) argument more than once to list all of the directories you want to compare.

Update main to loop through all of the directories passed through --dir, processing them each in turn. Your script should report the accuracy for each category in addition to reporting the overall accuracy.

Experiments

Design an experiment using at least four unique configurations from the following train and test set.

  • Train on wsj/train/male, wsj/train/female, wsj/train, or brown/

  • Test on wsj/test/male, wsj/test/female, or wsj/test

First, write down your hypothesis for what you expect to see when you compare your runs. Then, generate a table to report accuracy for each pair you test.

In analysis.md, comment on patterns you see. Are they consistent with the findings of Garimella et al.? How do you think the imbalance in data set size in our data affects the results you’re seeing? What other effects do you think could muddy the results you are seeing?