Important Terms
Chapter Notes
I. Introduction
A. you can use the probability (#1) of certain words appearing after others to correct spelling (#2), suggest suggestions, and provide more accurate translations
B. the simplest language model (#3) is the N-gram
1) 2-gram (bigram)
would predict "please turn" or "your homework"
2) 3-gram (trigram)
would predict "please turn your" or "turn your homework"
II. N-grams
A. P(w|h) is the probability of a certain word w appearing after some phrase h
ex) <s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
P(I|<s>) = 2/3
P(Sam|<s>) = 1/3
P(am|I) = 2/3
P(</s>|Sam) = 1/2
P(Sam|am) = 1/2
P(do|I) = 1/3
ex) P(i|<s>) = 0.25, P(want|I) = 0.33, P(english|want) = 0.0011, P(food|english) = 0.5, P(</s>|food) = 0.68
P(i want english food) = 0.25 * 0.33 * 0.0011 * 0.5 * 0.68 = 0.000031
or, use the log probability
def: P1 * P2 * P3 = exp(logP1 + logP2 + logP3)
III. evaluating language models
A. you can do it extrinsically (#5)
B. or you can do it intrinsically (#6), using a training set (#7) to get a prototype list of probabilities, and then apply it to a test set (#8)
C. once you use the training set enough, you call it the development set (or devset), and then use a different training set going forward
D. NEVER let the model see a sentence from the test set while training it
E. perplexity (#9)
1) you're looking to minimize it throughout the experiment