language-models

(just to be clear, most of this is from Dan Jurafsky in one form or another: it's either from the Jurafsky and Martin textbook (chapter 4), or it's from his NLP course on Coursera.)

What about language models?

Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful. - George E.P. Box

But it must be recognized that the notion "probability of a sentence" is an entirely useless one, under any known interpretation of this term. - Noam Chomsky (1969)

So it may turn out that this is an instance where we disagree with Chomsky. We need, for any number of applications, to be able to say that some sentences are just likelier than others. We'll talk about applications in a minute!

Let's decide which of these sentences is more likely!

the eight Center the recorded June, summer and the The first Climatic States ever period and Monday. hottest National July months in August Data the the reported continental of were was hottest of 2012 third United ever,

The first eight months of 2012 were the hottest ever recorded in the continental United States and the summer period of June, July and August was the third hottest ever, the National Climatic Data Center reported Monday.

(sentence from http://news.blogs.cnn.com/2012/09/18/2012-hottest-year-on-record-federal-agency-says/?hpt=hp_t3 )

Let's play a word guessing game!

"Please turn in your ..."

What words might come next?

What words probably don't come next?

How about these? These are perfectly good words:

scarves

cleat's

pertinence

Brillo

expendables

Why do we want to say that these are less likely choices?

So, what I'll claim that we want, is some kind of model that can tell us how good a sentence is. Or (equivalently?) how whether a word is good in a context.

We want to imagine that a language is this statistical process that's emitting words. And we want to ask the question "OK, what's the probability that the next word is going to be hyena?"

You might think that hyena is a pretty uncommon word, and you'd be right! But maybe not in some contexts. Maybe not near the word "aardwolf". OK, that's kind of ridiculous.

Goal: do the Chomsky thing and compute the probability of a sentence or a sequence of words:

P(W) = P(w1, w2, w3, ... wn)

Or alternatively, we could compute the probability of an upcoming word:

P(w5 | p1, p2, p3, p4)

We call a model that computes either of these a LANGUAGE MODEL or LM. That's just what it's called. Jurafsky says that maybe there could be better terms for this, like a "grammar". Although I think of a grammar as a (typically) a more structured thing, I guess this is sort of a very flat grammar.

Let's talk about situations where we'd want such a model.

Machine Translation

Which of these sentence is better in English?

- "high winds tonight"

- "large winds tonight"

(but you could easily imagine a language where they describe winds as "large". "high" doesn't make that much sense; it's just how English says it)

spelling correction (contextual spelling correction!!)

"... about fifteen minuets from ..."

When do you want such a thing? Any time a user is inputting text. Especially on mobile devices. Your device probably does this already. (does it do a good job? why, or why not? ...)

speech recognition

"I saw a van" vs. "eyes aw of Anne"

OCR: ocr is basically speech recognition, but with image input instead of audio! (many of the same techniques apply)

You could imagine using it for word segmentation...

What does the "probability of a sentence" mean, anyway? ...

You should understand what Chomsky is getting at here. He's got a point; he was just wrong that there's no use for this kind of model.

There's a supposedly infinite number of sentences in any language with center-embedding or just conjunctions, and we would want to assign some probability mass to each one of them.

Most sentences that you read, or hear, or say have never been composed before, at any point in history, ever. (write a sentence! any sentence! now google for it!")

On the other hand, eventually the stars will burn out, so there's a finite number of sentences that will be uttered.

But we have this really strong intuition that some strings of words are more likely than others, and that's the intuition that we're working with. Often we don't need a strict probability distribution anyway, just a score to compare two things and say that one's better than the other. We'll see later that methods that don't bother being a strict probability distribution can work just fine in practice.

OK, so how do we compute it?

Use the chain rule!

Recall the chain rule. Or perhaps come to know the chain rule for the first time.

P(A, B) = P(A) * P(B | A)

P(A, B, C) = P(A) * P(B | A) * P(C | A, B)

We're doing a slight notational abuse here. By P(C | A, B) we mean not just "A and B happened", but also "A and B happened in that order".

So, OK. Using this, we can generally say that the probability of a sentence is just:

PRODUCT { P( each word | history for that word) }.

Or to be mathier about it...

P(W) = PI { P(wi | w1, w2, ... w_{i-1} ) }

Probability of the sentence "its water is so transparent" is just... (write that down)

So we just need conditional probabilities.

How do we get those conditional probabilities?

We could count examples up from a corpus!

Say the beginning of a sentence is "I ate so many hot dogs" ... what are the most likely next words? Let's go to the web to find out!

Side note: Adam Kilgarriff, "Googleology is Bad Science" (2007). Says we shouldn't do exactly this.

But most contexts have never happened before.

We just said this earlier. Most sentences, and in fact most beginnings of sentences, are brand new in the history of the world. So even if you have an extraordinarily large corpus, you're not going to see most possible sentences!

How do we deal with that?

We approximate! AI is taking tasks that are impossible and then finding ways to approximate them!

Let's just make the simplifying assumption that only so much context matters.

P(that | I ate so many hotdogs) is approximated by P(that | hotdogs) .

Or maybe P(that | I ate so many hotdogs) is approximated by P(that | many hotdogs) .

So this is called the Markov assumption, after Andrey Markov senior. It's the assumption that, in some stochastic process where you want to know the probability of an event, you approximate by using less context. Only some fixed number of steps into the past matter. And this lets you get enough training data, often, to sensibly train these things.

HISTORICAL SIDE NOTE: there were a bunch of Markovs. There was Andrey Markov (senior), his brother Vladimir Markov, and his son Andrey Markov (jr). All important mathematicians. But when we talk about Markov, at least in main-line NLP, we mean this one, born in 1856.

http://en.wikipedia.org/wiki/Andrey_Markov

http://en.wikipedia.org/wiki/Vladimir_Andreevich_Markov

http://en.wikipedia.org/wiki/Andrey_Markov_(Soviet_mathematician)

(there were also a whole lot of Bernoulis, for what it's worth: http://en.wikipedia.org/wiki/Bernoulli_family)

This is what we mean by n-grams.

An n-gram is just a sequence of words of length n.

For a 1-long sequence, we call it a unigram.

2-long is bigram.

3-long is trigram.

And if we're estimating probabilities with these, we use all but that last word of the n-gram as context.

So when you're using bigrams to estimate probabilities, the second word is conditioned on the first word.

What are some problems with this?

Just to be clear, this sort of thing is used in practice all the time. But is it perfect?

“The computer which I had just put into the machine room on the fifth floor ... "

Would you expect P(crashed | floor) ? P(crashed | fifth floor) ?

This is called a long-distance dependency. There's contextual information here that's relevant, but it's not within a small number of tokens.

How do you train these things?

From corpora! You need a big ol' corpus to do it.

What kinds of knowledge are encoded in n-gram probabilities?

World knowledge?

P( English | I want) ??

Grammatical knowledge?

In practice, how do we do these?

Remember: we use logprobs. Do we remember that? Why do we do that?

How do we do that?

We can use software that other people wrote toooo!

SRILM. The source is available, although the license is kind of dorky.

There's one from CMU.

The new one seems to be the Ken LM!

MARKOV CHAINS: let's generate some language!!!

OK, what we're really here for. Let's generate some terrible text!

Apparently we have Claude Shannon to thank for beautiful, terrible Markov Chain text! Jurafsky tells us this is the "Shannon Visualization Method".

How does this work? Build an n-gram model. Let's be clear about what we're storing: conditional probabilities for each possible word, for each possible history. So it's just a big table of numbers.

Let's say we're doing bigrams, so context is one word.

We'll have to start with some context, say the start-of-sentence token. Now we look up which words can come after start-of-sentence token, and select one...

Let's build a bigram model ourselves!!

The sun did not shine

It was too wet to play

So we sat in the house

All that cold cold wet day

I sat there with Sally

We sat there we two

And I said How I wish

We had something to do

Too wet to go out

And too cold to play ball

So we sat in the house

We did nothing at all

(from the Cat in the Hat, by Dr. Seuss!)

OK, now let's sample from that, one word at a time.

(Quick Interlude with NLTK and corpus readers and the Text object ... )

>>> import nltk

>>> from nltk.corpus import brown

>>> text = nltk.Text(brown.words())

>>> text.concordance("committee")

How do we do that with NLTK?

import nltk

from nltk.corpus import brown

two = nltk.NgramModel(2, brown.sents())

# now we can sample from it!

two.generate(5)

Note, when we're sampling from these things, NLTK is likely to give us some warnings about the probability distributions not summing to 1. We'll talk about that in a bit, but for now, let's turn off warnings.

warnings.simplefilter('ignore')

OK, but Brown corpus isn't the only one that comes with NLTK. In fact, NLTK comes with a bunch of corpora! It actually comes with a bunch of corpora just from Project Gutenberg.

nltk.corpus.gutenberg.fileids()

Neeeaaaat!

Of course, we can use whatever text we want. These are just provided for convenience. Still, let's generate some new Shakespeare.

from nltk.corpus import gutenberg

shake_plays = [fn for fn in gutenberg.fileids() if fn.startswith("shake")]

shake_words = []

for play in shake_plays: words.extend(gutenberg.words(play))

Now we can make a Text out of that, and do a concordance...

import nltk

text = nltk.Text(words)

text.concordance("wherefore")

text.concordance("art")

(hey, what do we notice about "art" ?)

NB: Shakespeare is not "old English". You can't read Old English. Shakespeare isn't even middle English. Just earlier modern English. But it did have the you/thou distinction, which we've since lost.

(this is basically the same as the T-V distinction that you see in Spanish, French, German, and many other languages...)

OK, cool! Let's train an ngram model.

for play in shake_plays: sentences.extend(gutenberg.sents(play))

three = nltk.NgramModel(3, sentences)

And then of course you can do any old text. Such as, for example, Machine of Death! (most of which is Creative Commons)

>>> import nltk

>>> from machineofdeathcorpus import corpus

>>> corpus.sents()

... and so on.

How do we decide how good a language model is?

REMEMBER extrinsic vs intrinsic evaluation..

We can measure the Perplexity of the model on a test set (or a dev set, or any old purpose).

Perplexity was introduced by Shannon: how well can we predict the next word? How surprised are we, on average?

PP(sequence) = P(w1 ... wn) ^ ( - 1 / N)

This comes out to be the average branching factor. A lower perplexity is like "there were fewer probable things in the next spot, most of the time".

Zeros and smoothing and so on

OK, so the MLE for the probability of a word, given a context, we know how to calculate that.

What's the problem here?

OK, what happens if we do add-one smoothing? It's really ham-fisted, even for small vocabularies! It changes our probability estimates kind of a lot!

Add-one smoothing, maybe not a great idea for n-grams?

(can we generate tables out of code on our own? yeahh, probably...)

Let's do the example with the Cat and the Hat text.

Then let's compute the estimated counts back again.

Reconstituted count:

smoothed_count(w1,w2) = (C(w1, w2) + 1) * (C(w1)) / (C(w1) + V)

These are really quite different. Let's print out all the words with their actual counts and then their smoothed counts.

So there are some other techniques for smoothing n-grams -- there's Good-Turing, Kneser-Ney, and Witten-Bell. What we've learned from this is that smoothing methods need to be invented by two people!

Except in the case of...

Stupid Backoff

From Thorsten Brants et al (2007). They basically toss the restriction that we have a proper probability distribution.

... and this works pretty well in practice. At least with really enormous corpora.

References

Coursera NLP class! http://spark-public.s3.amazonaws.com/nlp/slides/languagemodeling.pdf

Jurafsky and Martin, chapter 4