NEWS: Please fill in the evaluation for this lab; it's in GUL: https://gul.gu.se/courseId/38351/content.do?id=18547189.
In this assignment you will create a NLTK feature structure grammar for a fragment of English sentences. In the VG part you will implement a CKY parser which returns parse trees.
You should submit the grammar as a text file called lab3_your_name.fcfg and the main program as the file lab3_your_name.py. You don't have to submit the corpus file testcorpus.txt, since I already have it. If you made the VG assignment you have to submit the file lab3vg_your_name.py too.
The main program should read in the grammar and create a chart parser. Then it should read the test corpus line by line, parse each line as a sentence, and print the number of parse trees plus the sentence itself. Like this:
$ python lab3_your_name.py
Nr trees Sentence
----------------------------------------------------------
1 a young woman sees two men
0 two young women sees a man
... ...
Note: you should use a FeatureChartParser, which you can create like this:
>>> parser = nltk.load_parser("file:lab3_your_name.fcfg")
Note that some of the sentences (the starred ones below) should be ungrammatical and will therefore return zero trees. Make sure that the grammatical sentences gets at least one tree, and that the ungrammatical don't get a tree at all.
Note: Don't forget to write good, readable and documented code! You can use Python-style comments in your feature grammar too, so use them!
First you create a grammar which handles the following sentences:
a young woman sees two men, but not *two young women sees a man
they leave her, but not *her leave they
you steal an old telescope, but not *you steal a old telescope
two angry men chase me, but not *two angry men chase I
I own an elevator, but not *I own a elevator
Your grammar should handle congruence within NPs, between NP/VP, and nominative/accusative pronouns. Furthermore it should handle the a/an distinction. For simplicity you can assume that all verbs are transitive.
Think of which categories/nonterminals you need, and which features each nonterminal need. E.g., for the a/an distinction, nouns and adjectives need a feature saying whether they start with a vowel or not.
Augment the grammar with prepositional phrases, to handle the following sentences:
a young woman sees two old men in the park with a telescope
the men with the telescope leave the park
the woman steals the telescope from the men in the park
two angry men chase a woman with a telescope
The grammar should handle ambiguities. E.g., for the last sentence there shold be at least two different readings.
Finally you augment the grammar to handle gapped relative clauses, both subject and object gaps:
a woman I know owns a telescope, but not *a woman I know a telescope
I know a woman who owns a telescope, but not *I know a woman owns a telescope
I see a telescope that a woman I know owns, but not *I see a woman I know owns
I see a telescope that I own, but not *I see a telescope that I owns
There are three kinds of clauses you should handle:
subject gap with relative pronoun: a woman [ who __ knows me ]
object gap with relative pronoun: a woman [ who I know __ ]
object gap without relative pronoun: a woman [ I know __ ]
Hint 1: Subject gaps are just verb phrases, so they should be easy. Object gaps are in general solved by slash categories (explained in section 9.3), but in this specific case we can do something simpler. You can create a new nonterminal, GappedVP, which has the same grammar rules as ordinary VPs, but where some of the arguments is missing. (You are of course welcome to solve it in a more general way if you like).
Hint 2: If you want to distinguish cases when "who" is ok to use, you should have a parameter saying if an NP is human or not. But you are allowed to skip this if you want, which will make your grammar slightly overgenerating. This means that you decide yourself if the following two sentences should be grammatical or not:
?I see a young woman that I know
?I see a young telescope who I know
Finally, combining congruence, PP attachment and relative clauses we can come up with the following sentence:
an old woman who owns a telescope in the park a young man owns sees them with the telescope she owns
The sentences above are collected in a text file that you can download from here. The file contains one sentence per line. The file also contains the following ungrammatical sentences:
*you owns an elevator
*I see two man with a telescope
*she see two men with a telescope
*I own a elevator with a telescope
*woman sees a young man
*a woman sees young man
*them see her
*they see she
*the you own a me
*I see the telescope in the the elevator
In the VG assignment you will augment the WFST recognizer from the NLTK book so that it 1) can recognize ambiguous sentences, and 2) builds the parse trees. The assignment should be in a file lab3vg_your_name.py, as usual.
Test grammars
The groucho_grammar in section 8.1 is not in Chomsky Normal Form (CNF), since there is a rule NP -> Det Noun PP. But if you change that rule into NP -> NP PP, you get a CNF grammar:
groucho_grammar = nltk.parse_cfg('''
S -> NP VP
PP -> Prep NP
NP -> Det Noun | NP PP | "I"
VP -> Verb NP | VP PP
Det -> "an" | "my"
Noun -> "elephant" | "pajamas"
Verb -> "shot"
Prep -> "in"
''')
You will also use another grammar, which is lexically ambiguous:
timeflies_grammar = nltk.parse_cfg("""
S -> NP VP
VP -> Verb PP | Verb NP
NP -> Det Noun | NP PP | Noun Noun
PP -> Prep NP
Verb -> "flies" | "like" | "time"
Prep -> "like"
Det -> "a" | "an"
Noun -> "arrow" | "flies" | "fruit" | "time" | "banana"
NP -> "arrow" | "flies" | "fruit" | "time" | "banana"
""")
(Note the last line, which you need since the rule NP -> Noun is not CNF).
There are three test sentences:
i_shot_an_elephant = "I shot an elephant in my pajamas".split()
time_flies = "time flies like an arrow".split()
fruit_flies = "fruit flies like a banana".split()
Put all the above in the beginning of your file.
The base CKY Parser
Create a class CKYParser which as a subclass of NLTK's ParserI interface:
class CKYParser(nltk.ParserI):
def __init__(self, grammar, trace=False):
assert grammar.is_chomsky_normal_form(), "The grammar must be in Chomsky Normal Form"
self.grammar = grammar
self.trace = trace
def create_wfst(self, tokens):
...
Start with taking the contents of the functions init_wfst() and complete_wfst() in example 8.9, and put it in the method .create_wfst(). You have to do some minor modifications, such as use self.grammar instead of grammar, etc. The method should not return the WFST, but instead store it in the attribute self.wfst. Now you should be able to test your parser class:
>>> import lab3vg_your_name as lab3
>>> groucho_parser = lab3.CKYParser(lab3.groucho_grammar, trace=True)
>>> groucho_parser.create_wfst(lab3.i_shot_an_elephant)
[2] Det [3] Noun [4] ==> [2] NP [4]
[5] Det [6] Noun [7] ==> [5] NP [7]
[1] Verb [2] NP [4] ==> [1] VP [4]
[4] Prep [5] NP [7] ==> [4] PP [7]
[0] NP [1] VP [4] ==> [0] S [4]
[2] NP [4] PP [7] ==> [2] NP [7]
[1] Verb [2] NP [7] ==> [1] VP [7]
[1] VP [4] PP [7] ==> [1] VP [7]
[0] NP [1] VP [7] ==> [0] S [7]
>>> groucho_parser.wfst[0]
[None, NP, None, None, S, None, None, S]
(Note that the final list is the top row of the pretty-printed table in example 8.9).
Now, if you try the other grammar you see that .create_wfst() has problems:
>>> timeflies_parser = lab3.CKYParser(lab3.timeflies_grammar, trace=True)
>>> timeflies_parser.create_wfst(lab3.time_flies)
[3] Det [4] Noun [5] ==> [3] NP [5]
[2] Verb [3] NP [5] ==> [2] VP [5]
The reason why the parser doesn't find anything is that it doesn't store sets of nonterminals in the cells.
Add sets of nonterminals
Modify the method .create_wfst() so that the contents of each cell in the WFST is a set of non-terminal symbols rather than a single non-terminal. (This is exercise 26 in chapter 8).
When initializing the WFST, use empty sets instead of None. And instead of assigning wfst[start][end] to the left-hand side, use the set .add() method. Furthermore, you have to loop over all nt1 in wfst[start][mid] and nt2 in wfst[mid][end].
>>> timeflies_parser = lab3.CKYParser(lab3.timeflies_grammar, trace=True)
>>> timeflies_parser.create_wfst(lab3.time_flies)
[0] Verb [1] NP [2] ==> [0] VP [2]
[0] Noun [1] Noun [2] ==> [0] NP [2]
[3] Det [4] Noun [5] ==> [3] NP [5]
[2] Verb [3] NP [5] ==> [2] VP [5]
[2] Prep [3] NP [5] ==> [2] PP [5]
[1] NP [2] VP [5] ==> [1] S [5]
[1] NP [2] PP [5] ==> [1] NP [5]
[1] Verb [2] PP [5] ==> [1] VP [5]
[0] NP [1] VP [5] ==> [0] S [5]
[0] Verb [1] NP [5] ==> [0] VP [5]
[0] NP [2] VP [5] ==> [0] S [5]
[0] NP [2] PP [5] ==> [0] NP [5]
>>> timeflies_parser.wfst[0]
[set([]), set([NP, Verb, Noun]), set([VP, NP]), set([]), set([]), set([VP, NP, S])]
This says that "time" is an NP, a Verb and a Noun; "time flies" is a VP and an NP, and that the show sentence is a VP, an NP and an S.
Add backpointers
Further modify .create_wfst() so that when a non-terminal symbol is added to a cell in the WFST, it includes a record of the cells from which it was derived. (This is part of exercise 35 in chapter 8).
Instead of implementing cells as sets of nonterminals, you can use a default dictionary from nonterminals to a list of backpointers. Initialize the WFST with defaultdict(list) for every cell.
A backpointer contains all the information you need to find the children of the given nonterminal. This extra information is the breaking point k and the two children nonterminals. Using the variable names in example 8.9, you see that:
wfst[start][end] is a default dictionary
wfst[start][end][lhs] is a list of backpointers
So the only thing you have to do is too .append() the backpointer to the backpointer list. Using the variable names from example 8.9, the backpointer itself can be a tuple (nt1,mid,nt2), or a simple string "word" if the phrase is derived from a lexical rule.
>>> timeflies_parser = lab3.CKYParser(lab3.timeflies_grammar)
>>> timeflies_parser.create_wfst(lab3.time_flies)
>>> timeflies_parser.wfst[0][1]
defaultdict(<type 'list'>, {NP: ["time"], Verb: ["time"], Noun: ["time"]})
>>> timeflies_parser.wfst[0][5]
defaultdict(<type 'list'>, {VP: [(Verb, 1, NP)], NP: [(NP, 2, PP)], S: [(NP, 1, VP), (NP, 2, VP)]})
The last item says that there are two ways of forming a sentence: One combines the NP "time" [0,1] with the VP "flies like an arrow" [1,5], and the other combines "time flies" [0,2] with "like an arrow" [2,5].
The backpointers in the cell [0,1] are lexical, so they are not tuples but instead pure strings.
Build parse trees
Implement a method .build_trees() that will generate all parse trees from the final WFST. The method should take three arguments, cat, start, end, representing the phrase for which you want to build your trees.
def build_trees(self, cat, start, end):
...
First extract the backpointers for the phrase. If start + 1 == end, then it is a lexical cell, so the backpointers are the word at the given position. You can then build a Tree(cat,[word]) from this.
If start + 1 > end, you have to loop through all (n1,mid,n2) in the backpointers. Then recursively build the smaller trees for the phrases (n1,start,mid) and (n2,mid,end). For each of these tree1 and tree2, you can build the Tree(cat,[tree1,trees]).
Easiest is if you yield each tree whenever it is created. This is more efficient, and the code becomes easier to read, than creating a list of trees. Ask the lab supervisor if you have any problems.
Wrapping things up
Finally, create the parser method .nbest_parse() which takes a list of tokens as input and returns a list of parse trees:
def nbest_parse(self, tokens):
self.create_wfst(tokens)
return list(self.build_trees(self.grammar.start(), 0, len(tokens)))
Note that you have to convert the generator returned by .build_trees() into a list. Here is a final example run:
>>> timeflies_parser = lab3.CKYParser(lab3.timeflies_grammar)
>>> for tree in timeflies_parser.nbest_parse(lab3.time_flies): print tree
(S
(NP time)
(VP (Verb flies) (PP (Prep like) (NP (Det an) (Noun arrow)))))
(S
(NP (Noun time) (Noun flies))
(VP (Verb like) (NP (Det an) (Noun arrow))))