Lab 6: POS Tagging

Starter Code

Get the starter code on GitHub Classroom.

Introduction

This week, you’ll spend some time working with a Hidden Markov Model Part of Speech tagger (an HMM POS tagger if you like your TLAs, or Three Letter Acronyms). Rather than implementing it from scratch, you’ll make a series of modifications to the tagger. Along the way, you’ll analyze the impact of different configuration settings on the tagger’s performance.

As usual, you'll be writing the answers to periodic questions in the lab writeup in analysis.md, which you'll render as a PDF using pandoc. If you don't have pandoc running on your computer, you can use Docker to manage this.

Understanding the Starter Code

The starter code includes three files:

HmmTagger.py defines a class HMMTagger that implements a hidden Markov model (HMM) part of speech tagger. You can find the Viterbi implementation in the predict function.
evaluate.py runs a part of speech tagger (like an HMMTagger) on a directory of data and calculates the tagger’s accuracy.
read_tags.py contains helper functions for loading directories full of texts labeled for part of speech.

The HMMTagger Class

The two main externally-facing functions of an HMMTagger object are train and predict. You can use an HMMTagger like this:

nlp = spacy.load("en_core_web_sm")

train_dir = "/cs/cs159/data/pos/wsj/train"

tagger = HMMTagger(nlp, alpha=0.1)

tagger.train(train_dir)

test_sentence = nlp("This is test input to the Part of Speech Tagger.")

tagger.predict(test_sentence)

print([token.tag_ for token in test_sentence])

To be consistent with the spacy interface for tagger objects, you will also be able to access predict by calling an HMMTagger with a sequence of tokens:

nlp = spacy.load("en_core_web_sm")

train_dir = "/cs/cs159/data/pos/wsj/train"

tagger = HMMTagger(nlp, alpha=0.1, vocab_size=20000)

tagger.train(train_dir)

test_sentence = nlp("This is test input to the Part of Speech Tagger.")

tagger(test_sentence)

print([token.tag_ for token in test_sentence])

The HmmTagger.py file can also be run as a script from the command line. When you do that, it has the following interface:

usage: HmmTagger.py [-h] --dir DIR --output FILE [--alpha ALPHA]

Train (and save) hmm models for POS tagging

optional arguments:

-h, --help show this help message and exit

--dir DIR, -d DIR Read training data from DIR

--output FILE, -o FILE

Save output to FILE

--alpha ALPHA, -a ALPHA

Alpha value for add-alpha smoothing

Running HmmTagger.py will train an HMMTagger object on all of the files in dir, then save the (binary) model to output in pickled form:

python3 HmmTagger.py --dir /cs/cs159/data/pos/wsj/train --output model.pkl

Document HMMTagger.py!

You will not write an HMM tagger this week, but you will still want to familiarize yourself with how it works so you can modify it. To start, take the time to carefully document the class. Every member function should have a doc string. The following functions should have block- or line-level comments that show you understand the computations that are being made:

do_train_sent()
train()
normalize_probabilities() (and, by extension, normalize())
predict()
backtrace()

When in doubt, err on the side of over-commenting, since we want to make sure that you understand all of the little details that go int the tagger.

Exploring Tag Set Collapsing

Right now, our code is set up to use the Penn TreeBank (PTB) tag set. We're going to need to make some modifications to our tagger so it can use the Universal Tag Set, which is a much smaller tag set than PTB. In this section of the lab, you will analyze the impact of tag set granularity on tagging performance.

To start, open up read_tags.py. At the top of the file, there is a dictionary called universal_to_ptb. The keys of this dictionary are Universal Tags, while the values are tuples of PTB tags that correspond to that tag. For instance, the tag "VERB" in the Universal Tag Set encompasses the tags "MD" (for modal verbs) as well as a number of tags starting with a V representing different verb forms (e.g. non-3rd person, gerunds, etc). You can refer to our textbook reading for a reference of PTB tags and the Universal Dependencies page for Universal POS tags.

Add code to create a dictionary ptb_to_universal. This dictionary will have a key for each PTB tag, whose value will be the corresponding Universal tag. Values in this dictionary should just be strings, not lists of strings.
Using this dictionary, modify the parse_file function to convert tags from PTB tags to Universal tags if do_universal is True.

Now, we'll test out several different configurations of how we use our tag set.

First, train an HMMTagger on the Brown Corpus brown data set (/cs/cs159/data/pos/brown) and test it on the Wall Street Journal wsj training data set (/cs/cs159/data/pos/wsj/train) using the Penn Treebank tags that are recorded in the data. Record your accuracy in a table in analysis.md. This will take a few minutes.
Next, repeat the process in Step 1, but convert all of the tags from PTB to the Universal Tag Set before training/testing. To do that:
- Add a new flag universal to the interface for the HmmTagger.py file in the Argument Parser. You should keep the existing default functionality if this flag isn't used, so have universal default to False by making its action="store_true". This will make it so calling the code with --universal will enable using this conversion, and not using the flag will default to the Penn TreeBank.
- Add a new named parameter do_universal to the HMMTagger class’s initializer method. do_universal should default to False. Inside the HMMTagger object, save the new parameter as a data member called self.do_universal.
- Ensure that the self.tags list initialized in HMMTagger now contains the Universal tag set instead of the PTB tag set.
- Update the main() function in HmmTagger.py to pass the new command-line argument args.universal to the HMMTagger initializer.
- Modify the train() function in HmmTagger.py to pass the value of self.do_universal to the parse_dir function. This will use the code you wrote for parse_file.
- Add a universal flag to evaluate.py just like you did for HmmTagger.py.
- Modify the main function of evaluate.py to pass the value of args.universal to parse_dir, just like you did for the training method of HMMTagger.

At the end of this process, report out the same information as you did for Step 1 in analysis.md.

Finally, try one more configuration. This time, you’ll train on the full PTB tag set, but at evaluation time, you’ll map all of the tags to the universal tag set before evaluating them. To do that:
- Modify the main function of evaluate.py so that after it calls tagger, but before it compares the tags, it changes each token’s tag_ attribute to the right Universal Tag Set tag. If you leverage what you have imported from read_tags.py, this should only require a couple of lines of code. Report out your evaluation results from this as well.

In analysis.md, describe how the performance of your tagger changed for each of the three configurations. What do the results say about tag granularity? Are the results surprising to you? Why or why not?

We might wonder if the accuracy is affected by a mismatch in the style of the data: maybe the Brown corpus is just very different from the WSJ corpus! Repeat the above three experiments, but using /cs/cs159/data/pos/wsj/train as your training data and /cs/cs159/data/pos/wsj/test as your testing data.

In analysis.md, comment on whether the results are consistent after the change in dataset. What, if anything, do you conclude from your results?

For the remainder of the lab, you should train using the full PTB Tag Set and test using the Universal Tag Set.

Exploring Vocabulary Size Effects

The starter HMM model adds all of the words in the training set to its vocabulary. For the brown data set, that means it has a vocabulary of 47,703 words, plus the <<OOV>> or "Out Of Vocabulary" token. (Replacing tokens that aren't in the vocabulary of a corpus with a placeholder is a common and useful convention in NLP!)

In this section of the lab, you will explore some of the time, space, and performance trade-offs that come from varying the size of the vocabulary.

First, modify the HmmTagger.py interface so that it can take a vocabulary size as a command-line argument. To do that, you should:

Add a --vocabsize, -v argument to the HmmTagger.py interface. The default value should be None, which will correspond to keeping all of the words in the vocabulary (in other words, the default behavior will be the same as the starter code behavior).
Modify update_vocab so that it only keeps the vocabsize most frequent words in the vocabulary, or all words if vocabsize is None.

Next, build six models. All of them should be trained on the Brown corpus using the PTB tag set. They should vary in their vocabulary size: 1000, 2000, 5000, 10000, 20000, or 50000. Note that since 50000 is greater than the number of different word types in the Brown corpus, it should keep all of the tokens in the vocabulary.

In your analysis.md, add a table that reports the following for each of the six models:

How long it takes to train the model /cs/cs159/data/pos/brown (in seconds)
How long it takes to test the model on the /cs/cs159/data/pos/wsj data (in seconds)
How big the model is (in kilobytes or megabytes, as appropriate), which you can find using ls -lh
The model’s accuracy

Hint: The command-line time command can be used to time how long another program takes to run. For example, to see how long it takes to build model.pkl (note that $ is the prompt symbol, and the typed command starts at time)

$ time python3 HmmTagger.py -d /cs/cs159/data/pos/brown -o model.pkl

real 1m13.679s

user 1m12.865s

sys 0m0.487s

This output says that it took 1:13 in real (clock) time, 1:12 in processor time, and 0.487 seconds of system time to run HmmTagger.py on my laptop. You should report the user time (or processor time) for this lab.

Hint: To repeat the same command on several values, you can use a bash for loop:

for <var> in <sequence>; do <command>; done. For example, if I have files named file1.txt, file2.txt, file3.txt and file4.txt in my directory, but meant them to be saved as Markdown files instead, I could run:

$ for i in 1 2 3 4; do mv file${i}.txt file${i}.md; done

…to rename the files to file1.md, file2.md, file3.md, and file4.md.

After you build your table, comment on the patterns you observe in analysis.md.

Exploring Document Size Effects

In this last section, you will explore whether the length of test documents affects the POS tagger’s performance.

For this part, you should train on the Brown data with a vocabulary size of 20,000. You should test on the WSJ data. Train with the PTB tag set and evaluate with the Universal set.

Update evaluate.py so that it generates a scatter plot of document size (in tokens, on the x-axis) and tagger accuracy (as a percent, on the y-axis). To do that you, you should:

Update evaluate.py’s argparse interface to take a new argument --output, -o that will take the name of a file to write an image to. This argument should default to None so that if it’s not provided, the behavior of the script will stay the same as it was in previous steps.
Update main() so that if args.output is not None, a scatter plot is generated showing document length compared to accuracy. If there, generate a scatter plot in a .png file where x axis is document size and y axis is accuracy.

Note: To get a point for every file instead of all of the data together, you won’t be able to use parse_dir directly. Instead, look at the parse_dir code in read_tags.py for a model of the some of the code you’ll want to add to your main function.

Include your plot in analysis.md. Comment on any patterns you notice, and discuss why you think you might see those patterns.

EXTRA CREDIT: Exploring "Gender" Effects on POS Tag Results

One of the readings this week was this paper from Garimella et al., which describes author gender and POS tagger performance. For extra credit this week, you can also do a reproducibility study to see if you see the same patterns in our HMMTagger that Garimella et al. reported in their paper.

Continue to use a vocabulary size of 20,000, training on the PTB tag set and testing on the Universal tag set. You will try a variety of train data sets. You will test on WSJ, since that’s the data set that we have (automatically-predicted from names) author gender information for.

The /cs/cs159/data/pos/wsj/train and /cs/cs159/data/pos/wsj/test directories are both split into two subdirectories, female/ and male/, corresponding to the article labels from Garimella et al.’s data release. As noted in our discussion this week, these are not actual self-reported gender labels, but were instead gathered using a heuristic applied to the authors' first names. Though there are many reasons we'd expect this heuristic to have high accuracy on this dataset, I generally consider to be a very bad idea in terms of the types of bias it can produce.

Set the action of the argparse dir argument to "append". That way, you’ll be able to use the -d (or --dir) argument more than once to list all of the directories you want to compare.

Update main to loop through all of the directories passed through --dir, processing them each in turn. Your script should report the accuracy for each category in addition to reporting the overall accuracy.

Experiments

Design an experiment using at least four unique configurations from the following train and test set.

Train on wsj/train/male, wsj/train/female, wsj/train, or brown/
Test on wsj/test/male, wsj/test/female, or wsj/test

First, write down your hypothesis for what you expect to see when you compare your runs. Then, generate a table to report accuracy for each pair you test.

In analysis.md, comment on patterns you see. Are they consistent with the findings of Garimella et al.? How do you think the imbalance in data set size in our data affects the results you’re seeing? What other effects do you think could muddy the results you are seeing?