NEWS: Please fill in the evaluation for this lab; it's in GUL: https://gul.gu.se/courseId/38351/content.do?id=18419202.
In this assignment you will experiment with NLTK corpora, and create and test different statistical part-of-speech taggers. You will use the Brown corpus throughout the lab.
This lab will be corrected by Grégoire, so you should submit by sending the file(s) in a mail to gdetrez@crans.org.
Important note! Before submitting, check with a lab supervisor that the output of your program(s) are correct.
As usual, the lab should be submitted as one single python program lab2_your_name.py. Each part below should define a function partN() that fist prints a simple header and then one or more result tables. When the file is run from the commad line it should call all functions part1() to part6(). Like this:
$ python lab2_peter_ljunglof.py
== Part 1: Brown genre statistics
Brown genre Tags Sents Words Sentlen Wordlen
--------------------------------------------------------------------------------
fiction 34 ... ... ... ...
...
reviews ... ... 40704 ... ...
== Part 2: N-gram statistics
1-gram Frequency Accum.freq.
--------------------------------------------------------------------------------
N 22.10% 22.10%
... ... ...
It should also be possible to test the different parts on their own from inside Python:
$ python
>>> import lab2_peter_ljunglof as lab2
>>> lab2.print_common_tag_ngrams("news", n=2, rows=4)
2-gram Frequency Accum.freq.
--------------------------------------------------------------------------------
DET N 6.08% 6.08%
... ... ...
... ... ...
. $ 4.04% 19.82%
Finally, don't forget to document all functions with docstrings!
The first part is to print out some statistics of selected categories of the Brown corpus with the simplified tagset (see section 5.2 for more details). Since most of this lab will be using the tagged sentences of the Brown subcorpora, the following function could prove useful:
def brown_tagged_sents(genre, simplify_tags=True):
"""Returns the tagged sentences of the given Brown category."""
return nltk.corpus.brown.tagged_sents(categories=genre, simplify_tags=simplify_tags)
Define a function print_brown_statistics(list_of_genres, simplify_tags=True) that prints statistics about the given genres.
>>> lab2.print_brown_statistics(["fiction", "government", "news", "reviews"])
The function part1() should print exactly the table above (with the blanks filled in of course).
In this part you should create a table for the most common tag ngrams (n=1, 2, 3, i.e., uni-/bi-/trigrams) in a given corpus. List the frequency and the accumulated frequency.
def print_common_tag_ngrams(genre, n, rows, simplify_tags=True):
(...)
Hint 1: Create a nltk.FreqDist of the tag ngrams (n=1, 2, 3), and from this you can use the methods .keys() and .freq().
Hint 2: You have to set the named arguments pad_left=True, pad_right=True, pad_symbol="$" when calling nltk.ngrams(). Otherwise you will not get the ngrams at the start and end of sentences.
Hint 3: You cannot flatten the list of sentences into a long list of words, because then you will lose the beginnings and ends of sentences. Instead you have to loop over one sentence at the time and update the frequency distribution inside the loop. Here is some pseudo-code of how you can calculate the frequency distribution of the ngrams:
create an empty frequency distribution
for each tagged sentence in the corpus:
create a list of tags from the sentence (which is a list of (word, tag) pairs)
create a list of the tag ngrams from the list of tags
for each ngram in the list of ngrams:
increase the count of the ngram in the frequency distribution
After this you should have a frequency distribution of the ngrams in the corpus. It is this distribution that contains the information you need to print the rows in the statistics table. The final function should be callable like this:
>>> lab2.print_common_tag_ngrams("news", n=2, rows=10)
The function part2() should print three 10-row tables, for the unigrams (n=1), bigrams (n=2) and trigrams (n=3) of the Brown news corpus (with simplified tags).
Here you will create a sequence of part-of-speech taggers for a given Brown genre, using NLTK's built-in tagger classes.
First, divide the corpora into training and test sentences. To get consistent results for everyone, use the first 500 sentences for testing and the rest for training. Define a function split_sents(sents) that splits the sentences into a tuple (train, test).
The default tagger should select the most common POS tag by inspecting the training corpus. Define a function most_common_tag(tagged_sents) for that:
>>> news_train, news_test = lab2.split_sents(lab2.brown_tagged_sents("news"))
>>> lab2.most_common_tag(news_train)
"NN"
Now define a function train_nltk_taggers(train_sents) that returns a default-, an affix-, a unigram-, a bigram-, and a trigram tagger. The corresponding NLTK classes are the following:
DefaultTagger(default_tag)
AffixTagger(backoff=None)
UnigramTagger(backoff=None)
BigramTagger(backoff=None)
TrigramTagger(backoff=None)
Every new tagger you create (apart from the default tagger) should take the previous as a backoff. The function should return a tuple of all the five taggers:
>>> taggers = lab2.train_nltk_taggers(news_train)
>>> len(taggers)
5
Define a function print_nltk_taggers_table(genre). It should create the taggers for the given Brown genre and evaluate them, displaying the results in a nice table:
The following (undocumented) functions might be of help when printing the table (and the tables later on). But if you use them you have to write docstrings!
def print_report_header(title, comment=""):
"""I'm in desperate need of a docstring!"""
print ("%-20s Accuracy Errors %s" % (title, comment))
print 80 * "-"
def print_report_line(title, accuracy, comment=""):
"""Me too!"""
errors = 1.0 / (1.0 - accuracy)
print ("%-20s%7.2f%% %4.1f words/error %s" % (title, 100.0 * accuracy, errors, comment))
The function part3() should print the tagger table for the news genre.
Now you will use the bigram tagger from above as a baseline for evaluating different test sets and training corpora. Define a function train_bigram_tagger(train_sents) that calls train_nltk_taggers and returns only the bigram tagger.
The final function part4() should call the functions from parts 4a–4e below, with the argument given in the examples.
Part 4a: Test on the train sentences
Test on the training sentences and see how this affects the accuracy:
>>> lab2.test_on_training_set("news")
Part 4b: Test on different genres
Evaluate your tagger on several different genres:
>>> lab2.test_different_genres("news", ["fiction", "government", "news", "reviews"])
Note: Remember to split all genres into training and test sentences, and only use the test set for evaluation. Otherwise you will test on the training set (for the news genre).
Part 4c: Train on different corpora sizes
Train a new tagger on 100%, 75%, 50% and 25% of the training sentences, and evaluate each of them on the same test sentences:
>>> lab2.train_different_sizes("news", [100, 75, 50, 25])
Part 4d: Compare different train/test partitions
Now compare the effect of different partitionings. First test the partition that you have tested before (first 500 sents for testing, the rest for training). Then use the last 500 sentences for testing, and the rest for training.
>>> lab2.compare_train_test_partitions("news")
Part 4e: Compare different tagsets
Compare how the number of POS tags affects the accuracy. First you need a baseline. This will be the same corpus as always, i.e., the Brown news corpus with the simplified tagset. Divide the corpus into training data and test data as usual. Train the bigram tagger and evaluate.
Second, compare the baseline with a larger tagset. Use the non-simplified tagset (simplify_tags=False) for the Brown news corpus. Divide the corpus into training and test as usual. Train the bigram tagger and evaluate.
Third, compare the baseline with an even smaller tagset. Translate the simplified Brown news corpus into the (even more simplified) tagset: N, NP, V, AUX, DELIM. Divide into train and test set, train and evaluate as before.
To translate the corpus you need a function that converts a simplified Brown tag into one of the five super-simple tags:
>>> supersimple("PRO")
"NP"
Use the following translation table:
N, NUM ==> N
NP, PRO, EX, ADJ, DET, WH ==> NP
MOD, all tags starting with V ==> V
CNJ, TO, ADV, P, FW, UH, * ==> AUX
NIL, all non-alpha tags (except *), including the empty string tag ==> DELIM
The result should be like this:
>>> lab2.compare_different_tagsets("news")
In the VG assignment you will experiment with some advanced POS taggers, and then write a report about the results from the whole lab. The programming part should be submitted as one single file lab2vg_your_name.py, which should be runnable from the commad line. The report should be called lab2report_your_name.{txt/pdf/doc}.
In this part you will train a Brill tagger using NLTK's FastBrillTaggerTrainer. If you haven't already done so, read about Brill tagging in J&M section 5.6.
It needs a basline tagger, and you should use the unigram tagger from part 3 above. First you create a tagger trainer from the baseline tagger and a set of rule templates. Then you call the method train with the training data and the maximum number of rules you want to learn:
>>> (_default, _affix, unitagger, _bi, _tri) = lab2.train_nltk_taggers(news_train)
>>> baseline_tagger = unitagger
>>> trainer = nltk.FastBrillTaggerTrainer(baseline_tagger, rule_templates)
>>> tagger = trainer.train(news_train, max_rules=1000)
>>> result = tagger.evaluate(news_test)
>>> print ("Brill tagger with %d rules, evaluated on the %s genre: %.2f%% accuracy" %
... (len(tagger.rules()), "news", 100.0 * result))
The result from this part should be something similar to the output of the print statement above.
Rule templates
The difficult part with Brill tagging is to come up with good rule templates. The more templates the better result, but then the training can take much longer time. Here are some suggestions for templates, but feel free to come up with your own:
change the POS of a word, depending on the POS of the previous word
change the POS of a word, depending on the POS of any of the two previous words
change the POS of a word, depending on the POS of any of the three previous words
change the POS of a word, depending on the POS of the previous word and the POS of the next word
change the POS of a word, depending on the previous word
change the POS of a word, depending on any of the two previous words
change the POS of a word, depending on any of the three previous words
change the POS of a word, depending on the previous word and the next word
The following two functions can be used to create tag resp. word templates:
def tag_template(*boundaries):
return nltk.tag.ProximateTokensTemplate(nltk.tag.ProximateTagsRule, *boundaries)
def word_template(*boundaries):
return nltk.tag.ProximateTokensTemplate(nltk.tag.ProximateWordsRule, *boundaries)
The functions will then take any number of arguments, where each is a pair (start,end) which specifies a range for which a condition should be created for each rule. E.g., rule template 2 above is specified by tag_template(-2,-1) while template 4 is specified by tag_template((-1,-1),(1,1)).
The NLTK book doesn't have any information about the Brill tagger, so you have to use Python's help system to learn more. (Or ask the supervisors:)
VG assignment, part 2: Create your own bigram HMM tagger with smoothing
In this part you will create a HMM bigram tagger using NLTK's HiddenMarkovModelTagger class. Again, this is not covered by the NLTK book, but read about HMM tagging in J&M section 5.5.
The HMM class is instantiated like this:
>>> words = list(set(...all words in the training data...))
>>> tags = list(set(...all POS tags in the training data...))
>>> transitionsCPD = nltk.ConditionalProbDist(...)
>>> outputsCPD = nltk.ConditionalProbDist(...)
>>> priorsPD = (...some subclass of nltk.ProbDistI...)
>>> tagger = nltk.HiddenMarkovModelTagger(words, tags, transitionsCPD, outputsCPD, priorsPD)
The reason for using list(set(...)) is that there is a minor bug in the HMM class. We need the following probability distributions:
priorsPD: the initial state distribution; P(ti) is the probability of starting in state (tag) ti
outputsCPD: the output probabilities; P(wk | ti) is the probability of emitting symbol (word) wk when entering state (tag) ti
transitionsCPD: the transition probabilities; Pr(ti | tj) is the probability of transition from state (tag) ti given the model is in state (tag) tj
outputsCPD corresponds to the word likelihood probability in equation 5.31 in J&M, whereas transitionsCPD corresponds to the tag transition probabilities in the same equation. priorsPD is NLTK's way of handling the sentence initial probabilities P(ti | <s>) that are shown in the table in figure 5.15 in J&M.
Define a function train_hmm_tagger(train_sents, probdist), that creates a HMM tagger by calculating the probability distributions from the training corpus:
priorsPD is created from a suitable FreqDist and the number of states
outputsCPD is created from a ConditionalFreqDist, the probdist factory and the number of symbols
transitionsCPD is created from a ConditionalFreqDist, the probdist factory and the number of states.
The probdist factory specifies which kind of smoothing you will use. It is a function that takes two arguments (a FreqDist and the number of bins) and creates a nltk.ProbDist. Here are some examples of probdists:
probdist = lambda fd, bins: nltk.LaplaceProbDist(fd, bins)
probdist = lambda fd, bins: nltk.LidstoneProbDist(fd, 0.1, bins)
probdist = lambda fd, bins: nltk.WittenBellProbDist(fd, bins)
probdist = lambda fd, bins: nltk.SimpleGoodTuringProbDist(fd, bins)
The final output from this part should be a table where you train and evaluate several HMM taggers on the news genre. Use the probdist factories above. Try at least three Lidstone gamma values, 0.1, 0.01 and 0.001.
Finally, write a report (max 2 pages) where you discuss the results from this lab, parts 1–4 and the VG parts. Try to explain why you get the results you get – why are some evaluations better and some worse?