(readings: Chapter 5 of J&M)
The idea of parts of speech is pretty old. It seems like a pretty good idea -- there do seem to be some lexical categories that words fall into. And they can often replace one another, and we've still got a valid sentence.
Think about playing Mad Libs: you're often asked for something fairly specific ("means of transportation" or whatever), but when you're not, those might be parts of speech! Because any one of them could drop into the sentence, and it's still syntactically valid.
OK, well, what are the parts of speech that you were taught in school?
School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun, interjection.
(and that seems... pretty explanatory for English.)
Slide from Chris Manning: this is a really good slide. Thanks, Chris!
OK, so what do we mean by open and closed classes?
Clearly the closed classes aren't *really* closed, since, as we might observe, English and German have different function words... but new open-class words are getting coined all the time.
The idea that there have to be eight of them seems to come from DIONYSIUS THRAX of ancient Greece, who used the tagset:
noun, verb, article, adverb, preposition, conjunction, participle, pronoun.
So it might turn out that the grammar they taught you in middle school is not entirely sufficient to explain all of language.
Particularly, consider the situation with Thrax's eight parts of speech: maybe they didn't have adjectives in Ancient Greek? Not every language seems to have adjectives. In Korean, for example, they're apparently somewhat verb-like.
But moreover: not all adjectives in English seem to be in the same class. I think the adjective/article distinction is pretty useful, because articles in English behave very differently from other adjectives.
Also moreover: we're pretty comfortable saying that not all languages have articles!
We could also imagine that we want to distinguish between mass nouns and count nouns, or based on noun case, or maybe verbs based on tense, or grammatical gender or something.
The point we want to make here is that there are lots of distinctions that we could choose to make, depending on the task and the language that we're working with.
And here's the Brown tagset:
http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html
Slav Petrov has a list that seems to work well for a lot of tasks, for a lot of languages.
. ADJ ADP ADV CONJ DET NOUN NUM PRON PRT VERB X
What do we notice about this?
Well, it's got punctuation. It doesn't have a "preposition" category...
Because it's a really useful step to do before other tasks!
POS tags are a useful abstraction for many other tasks: when you're trying to parse a sentence, it sure helps to know which words are nouns and which ones are verbs.
Because it helps us pronounce words right, if we're a text-to-speech system. What's the difference between "content" and "content"? Or "lead" and "lead"? Or "wind" and "wind"?
You could also imagine having a pos-tagged corpus, and, as a linguist, wanting to do searches over it. Show me the words that come after that come after thus-and-such a verb, but only the nouns.
Many words are ambiguous!
Many words aren't ambiguous, of course, but lots of them are. What about "back" (noun, verb, adverb, adjective) or "around" (preposition, adverb) ...
If you always guess the most common tag for a given word, in English, for small tagsets, you can get about 90% accuracy, just because you get points for the easy words and the punctuation.
You could imagine writing a bunch of rules to get the disambiguation right! This has been done, but you need kind of a lot of rules! And then you have to write a bunch of rules!
You could do what's called a "feature-based classifier".
What kinds of features would you give to the classifier?
You might look at the word itself. You might do some stemming (this is especially important in morphologically rich languages: the affixes might explicitly tell you what the POS tag is, even if you've never seen this word before).
You could use the words in the nearby context! For English, this works pretty well!
We want to find the maximum T for P(T | W), where T is a sequence of tags and W is a sequence of words.
Let's do some Bayes Rule.
P(T|W) = P(W|T) * P(T) / P(W)
... and like before, P(W) is constant. That's like the probability of our features. So we'll drop that from our argmax.
OK, so now we just need a way to estimate the probability of the words, given the tags, and then the probability of the tags!
That sounds pretty hard too, so let's make some...
We need to come up with approximations for both the LIKELIHOOD term and the PRIOR term. Let's talk about the prior term first.
Like we did before, with n-gram models for words, let's assume that the probability of a given tag only depends on the tag before it.
P( t | all the other tags) is approximated as P(t | t-1)
Q: these are like what, over tags? what would we call this model, if we were doing words?
That's a pretty drastic simplifying assumption, but let's run with it.
OK, now we're going to make another simplifying assumption, and imagine that the probability of a word only depends on the probability of the tag for that spot. We're imagining language as this stochastic process that generates tags, and then from those tags, generates words.
Just to make sure we're clear about this, let's write down the expression for the probability of a SEQUENCE OF TAGS for a given sentence.
How do we train these probabilities? What kind of resource would we need to do it?
(see example on page 143 of the book)
Fluttershy is expected to race tomorrow.
Consider the possibilities here for "race". It could be NN (noun), or it could be VB (verb).
possibility 1: NNP VBZ VBN TO VB NR
possibility 2: NNP VBZ VBN TO NN NR
OK, so what are the probabilities that we would have to take into account to make that one decision, VB vs NN ?
We'd need:
P( race | NN) (how do we compute that again? ...)
P(race | VB)
These numbers are pretty comparable in Brown Corpus. Moderately common for both nouns and verbs. But then. BUT THEN.
What's P( NN | TO) and P(VB| TO) ? Very different.
Brown tagset, in case you need it: http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html
OK, but that's just one decision. We have a bunch of decisions to make. We need to get the most probable tag sequence OVER ALL POSSIBLE TAG SEQUENCES.
Unfortunately, now we have combinatorially many possibilities to try. Say you have a 40 word sentence, and 10 of them are ambiguous, and they each have two possible POS tags. How do we choose the most probable sequence? (because there are clearly relationships between the choices we make, right? We want to do better than just greedily taking the most likely tag for a given word).
There are some really good algorithms for handling this.
We're going to do what's called dynamic programming, which is a thing we'll see a lot in the rest of this class.
References:
Jurafsky and Martin, chapter 5.
Coursera NLP: http://spark-public.s3.amazonaws.com/nlp/slides/Maxent_PosTagging.pdf
Slides from Markus: http://jones.ling.indiana.edu/~mdickinson/12/645/slides/06-markov/06-markov.pdf
Slides from Markus: http://jones.ling.indiana.edu/~mdickinson/12/645/slides/07-tagging/07a-tagging.pdf
Mad Libs! http://www.madlibs.com/
Petrov et al's Universal POS tags: http://code.google.com/p/universal-pos-tags/