Probability
Sample space: Ω
Outcome: a ∈ Ω
Event: A ⊆ Ω
Probability of an event: 0 ≤ P(A) ≤ 1
P(Ω) = 1
P(∅) = 0
P(B – A) = P(B) - P(A ∩ B)
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
i.e., P(A ∪ B) = P(A) + P(B), if A and B are disjoint
etc.
Independent events: P(A∩B) = P(A) P(B)
Conditional probability
P(A | B) = P(A ∩ B) / P(B)
e.g., P(A | A) = 1 and P(A | ¬A) = 0
Independent events: P(A | B) = P(A) iff A and B are independent
Bayes' theorem: P(A | B) = P(B | A) P(A) / P(B)
since: P(B) P(A|B) = P(A∩B) = P(A) P(B|A)
Partitions
A partition A1 … An är a division of the sample space Ω, where:
all Ai, Aj (whenever i≠j) are disjoint: Ai ∩ Aj = ∅
A1…An covers the whole sample space: A1 ∪ … ∪ An = Ω
P(B) = P(B|A1) P(A1) + … + P(B|An) P(An)
Formal languages och language models
An alphabet is a finite set Σ:
in morphology it is often letters
in phonology it is often phonemes
in tagging/parsing it is often words
sometimes morphemes, e.g. the girl's can be represented as the + girl + 's
A language L(Σ) is a set of strings over Σ:
i.e., L(Σ) ⊆ Σ*
A language model (LM) is a probabilistic function over strings:
i.e., a probability measure P(w1…wn)
The chain rule: P(w1…wn) = P(w1) P(w2 | w1) P(w3 | w1w2) … P(wn | w1…wn-1)
Corpora [J&M, 4.1, 4.3]
You should be able to know the difference between:
wordform and lemma
type and token
A sensible corpus for POS tagging should have at least 1 million words. The more the merrier. But it is important that the corpus is manually tagged (or at least manually corrected). Otherwise you will not learn any new information.
The training and test corpus must not overlap! You are not allowed to use the test corpus in any way to make the evaluation better!
N-grams [J&M, 4.2]
Ett N-gram är en sekvens av N ord:
unigram: wn
bigram: wn-1 wn
trigram: wn-2 wn-1 wn
The Markov assumption is that (the probability of) a word only depends on the previous word(s).
An N-gram model makes use of the Markov assumption:
a word only depends on the N-1 previous words
P(wn | w1…wn-1) ≈ P(wn | wn-(N-1)..wn-1)
Applied on unigrams, bigrams and trigrams:
N=1: P(wn | w1…wn-1) ≈ P(wn)
N=2: P(wn | w1…wn-1) ≈ P(wn | wn-1)
N=3: P(wn | w1…wn-1) ≈ P(wn | wn-2wn-1)
Unigram, bigram and trigram language models:
N=1: P(w1…wn) ≈ P(w1) P(w2) … P(wn) = ∏i P(wi)
N=2: P(w1…wn) ≈ P(w1 | w0) P(w2 | w1) … P(wn | wn-1) = ∏i P(wi | wi-1)
N=3: P(w1…wn) ≈ P(w1 | w-1w0) P(w2 | w0w1) … P(wn | wn-2wn-1) = ∏i P(wi | wi-2wi-1)
(where w-1 and w0 are “dummy” words, which we put in to get simpler calculations)
Maximum Likelihood Estimation (MLE)
To normalise is to multiply all values with a constant, so that you get a probability distribution. I.e., the sum of all normalised values is 1.
Maximum likelihood estimation (MLE) is to assume that the probabilities in the real world is exactly like the distribution in the training corpus:
P(wn | wn-1) = C(wn-1wn) / C(wn-1)
P(wn | wn-2wn-1) = C(wn-2wn-1wn) / C(wn-2wn-1)
Smoothin, interpolation and backoff [J&M, 4.5-4.7]
The most important thing you need to know is why smoothing, interpolation and backoff is necessary!
Smoothing
Laplace smoothing is good to know about. The idea is to increase the number of occurrences by 1 for every possible unigram/bigram/trigram, even the ones that are not in the corpus.
For unigrams: P*(wn) = (C(wn)+1) / (N+V)
För bigram: P*(wn | wn-1) = (C(wn-1wn)+1) / (C(wn-1)+V)
(where N = the number of words in the corpus, and V = the number of words in the lexicon)
Usually you get even better results if you add something less than 1, which is called Lidstone smoothing in NLTK. Of if you use smooting á la Good-Turing, Witten-Bell, and Kneser-Ney.
Interpolation and backoff
If we want to calculate the trigram probability P(wn | wn-2wn-1), but there is not enough information in the corpus, we can use the bigram probability P(wn | wn-1) for guessing the trigram probability. And if we don't have enough information to calculate the bigram, we can use the unigram probability P(wn).
Interpolation is that you calculate the trigram probability as a weighted sum of the actual trigram, bigram and unigram probabilities.
Backoff is that you choose either the one or the other: If you have enough information about the trigram, choose the trigram probability, otherwise choose the bigram probability, or even the unigram probability. (And then you have to normalise the probabilities).