Solution to J&M exercise 4

J&M chapter 4: My Teeny-Weeny Swedish Corpus

Here is a NLTK dry-run of the teeny-weeny corpus exercise:

>>> import nltk

>>> corpus = u"<s> en såg såg en såg en såg såg , en annan sågade sågen sågen såg . </s>".split()

>>> sentence = u"<s> en sågade en såg </s>".split()

>>> vocabulary = set(corpus)

>>> len(vocabulary)

>>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus))

# The corpus counts of each bigram in the sentence:

>>> [cfd[a][b] for (a,b) in nltk.bigrams(sentence)]

[1, 0, 0, 3, 0]

# The counts for each word in the sentence:

>>> [cfd[a].N() for (a,b) in nltk.bigrams(sentence)]

[1, 4, 1, 4, 6]

# The MLE probability for each bigram:

>>> [1.0 * cfd[a][b] / cfd[a].N() for (a,b) in nltk.bigrams(sentence)]

[1.0, 0.0, 0.0, 0.75, 0.0]

# There is already a FreqDist method for MLE probability:

>>> [cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)]

[1.0, 0.0, 0.0, 0.75, 0.0]

# The probability of the sentence is the product of all bigram probabilities:

>>> reduce(lambda x,y:x*y, _)

0.0

# Laplace smoothing of each bigram count:

>>> [1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)]

[2, 1, 1, 4, 1]

# We need to normalise the counts for each word:

>>> [len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)]

[10, 13, 10, 13, 15]

# The smoothed Laplace probability for each bigram:

>>> [1.0 * (1+cfd[a][b]) / (len(vocabulary)+cfd[a].N()) for (a,b) in nltk.bigrams(sentence)]

[0.20000000000000001, 0.076923076923076927, 0.10000000000000001, 0.30769230769230771, 0.066666666666666666]

# The smoothed probability of the sentence:

>>> reduce(lambda x,y:x*y, _)

3.1558185404339259e-05

# Or in human-readable form:

>>> print "%.10f" % _

0.0000315582

Here is a more compact dry-run, using NLTK's internal ConditionalProbDist, MLEProbDist and LaplaceProbDist:

# MLEProbDist is the unsmoothed probability distribution:

>>> cpd_mle = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist, bins=len(vocabulary))

# Now we can get the MLE probabilities by using the .prob method:

>>> [cpd_mle[a].prob(b) for (a,b) in nltk.bigrams(sentence)]

[1.0, 0.0, 0.0, 0.75, 0.0]

# LaplaceProbDist is the add-one smoothed ProbDist:

>>> cpd_laplace = nltk.ConditionalProbDist(cfd, nltk.LaplaceProbDist, bins=len(vocabulary))

# Getting the Laplace probabilities is the same as for MLE:

>>> [cpd_laplace[a].prob(b) for (a,b) in nltk.bigrams(sentence)]

[0.20000000000000001, 0.076923076923076927, 0.10000000000000001, 0.30769230769230771, 0.066666666666666666]

Page updated

Google Sites

Report abuse