Probability and n-grams

Basic probability theory [background]

Probability

Sample space: Ω

Outcome: a ∈ Ω

Event: A ⊆ Ω

Probability of an event: 0 ≤ P(A) ≤ 1

- P(Ω) = 1
- P(∅) = 0
- P(B – A) = P(B) - P(A ∩ B)
- P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
  - i.e., P(A ∪ B) = P(A) + P(B), if A and B are disjoint
- etc.

Independent events: P(A∩B) = P(A) P(B)

Conditional probability

P(A | B) = P(A ∩ B) / P(B)

- e.g., P(A | A) = 1 and P(A | ¬A) = 0

Independent events: P(A | B) = P(A) iff A and B are independent

Bayes' theorem: P(A | B) = P(B | A) P(A) / P(B)

- since: P(B) P(A|B) = P(A∩B) = P(A) P(B|A)

Partitions

A partition A₁ … A_n är a division of the sample space Ω, where:

- all A_i, A_j (whenever i≠j) are disjoint: A_i ∩ A_j = ∅
- A₁…A_n covers the whole sample space: A₁ ∪ … ∪ A_n = Ω

P(B) = P(B|A₁) P(A₁) + … + P(B|A_n) P(A_n)

N-grams [J&M, chapter 4]

Formal languages och language models

An alphabet is a finite set Σ:

- in morphology it is often letters
- in phonology it is often phonemes
- in tagging/parsing it is often words
  - sometimes morphemes, e.g. the girl's can be represented as the + girl + 's

A language L(Σ) is a set of strings over Σ:

- i.e., L(Σ) ⊆ Σ*

A language model (LM) is a probabilistic function over strings:

- i.e., a probability measure P(w₁…w_n)

The chain rule: P(w₁…w_n) = P(w₁) P(w₂| w₁) P(w₃| w₁w₂) … P(w_n| w₁…w_n-1)

Corpora [J&M, 4.1, 4.3]

You should be able to know the difference between:

- wordform and lemma
- type and token

A sensible corpus for POS tagging should have at least 1 million words. The more the merrier. But it is important that the corpus is manually tagged (or at least manually corrected). Otherwise you will not learn any new information.

The training and test corpus must not overlap! You are not allowed to use the test corpus in any way to make the evaluation better!

N-grams [J&M, 4.2]

Ett N-gram är en sekvens av N ord:

- unigram: w_n
- bigram: w_n-1w_n
- trigram: w_n-2w_n-1w_n

The Markov assumption is that (the probability of) a word only depends on the previous word(s).

An N-gram model makes use of the Markov assumption:

- a word only depends on the N-1 previous words
- P(w_n | w₁…w_n-1) ≈ P(w_n | w_n-(N-1)..w_n-1)

Applied on unigrams, bigrams and trigrams:

- N=1: P(w_n | w₁…w_n-1) ≈ P(w_n)
- N=2: P(w_n | w₁…w_n-1) ≈ P(w_n | w_n-1)
- N=3: P(w_n | w₁…w_n-1) ≈ P(w_n | w_n-2w_n-1)

Unigram, bigram and trigram language models:

- N=1: P(w₁…w_n) ≈ P(w₁) P(w₂) … P(w_n) = ∏_i P(w_i)
- N=2: P(w₁…w_n) ≈ P(w₁| w₀) P(w₂ | w₁) … P(w_n | w_n-1) = ∏_i P(w_i | w_i-1)
- N=3: P(w₁…w_n) ≈ P(w₁ | w_-1w₀) P(w₂ | w₀w₁) … P(w_n | w_n-2w_n-1) = ∏_i P(w_i | w_i-2w_i-1)

(where w_-1 and w₀ are “dummy” words, which we put in to get simpler calculations)

Maximum Likelihood Estimation (MLE)

To normalise is to multiply all values with a constant, so that you get a probability distribution. I.e., the sum of all normalised values is 1.

Maximum likelihood estimation (MLE) is to assume that the probabilities in the real world is exactly like the distribution in the training corpus:

- P(w_n | w_n-1) = C(w_n-1w_n) / C(w_n-1)
- P(w_n | w_n-2w_n-1) = C(w_n-2w_n-1w_n) / C(w_n-2w_n-1)

Smoothin, interpolation and backoff [J&M, 4.5-4.7]

The most important thing you need to know is why smoothing, interpolation and backoff is necessary!

Smoothing

Laplace smoothing is good to know about. The idea is to increase the number of occurrences by 1 for every possible unigram/bigram/trigram, even the ones that are not in the corpus.

- For unigrams: P*(w_n) = (C(w_n)+1) / (N+V)
- För bigram: P*(w_n | w_n-1) = (C(w_n-1w_n)+1) / (C(w_n-1)+V)

(where N = the number of words in the corpus, and V = the number of words in the lexicon)

Usually you get even better results if you add something less than 1, which is called Lidstone smoothing in NLTK. Of if you use smooting á la Good-Turing, Witten-Bell, and Kneser-Ney.

Interpolation and backoff

If we want to calculate the trigram probability P(w_n | w_n-2w_n-1), but there is not enough information in the corpus, we can use the bigram probability P(w_n | w_n-1) for guessing the trigram probability. And if we don't have enough information to calculate the bigram, we can use the unigram probability P(w_n).

Interpolation is that you calculate the trigram probability as a weighted sum of the actual trigram, bigram and unigram probabilities.

Backoff is that you choose either the one or the other: If you have enough information about the trigram, choose the trigram probability, otherwise choose the bigram probability, or even the unigram probability. (And then you have to normalise the probabilities).

Page updated

Google Sites

Report abuse