Here are some suggested exercises from the NLTK book.
Easy: 6, 7, 12, 13, 16
Intermediate: 19, 20, 24, 26, 27, 28
Easy: 4
Intermediate: 8, 15, 16, 17, 18, 22
Difficult: 23, 25
Random trigram generation
In the Nov. 7 lecture, we modified example 2.5 to do random bigram generation. Like this:
import random
def generate_model(cfdist, word, num=15):
for i in range(num):
print word,
word = generate_next(cfdist[word])
def generate_next(fdist):
rnd = random.randrange(fdist.N())
ctr = 0
for word in fdist:
ctr += fdist[word]
if ctr > rnd:
return word
>>> text = nltk.corpus.genesis.words('english-kjv.txt')
>>> bigrams = nltk.bigrams(text)
>>> cfd = nltk.ConditionalFreqDist(bigrams)
>>> generate_model(cfd, 'living')
This exercise is then to modify the two functions to do trigram generation instead. Then you need a CFD created from ((word1,word2), word3) tuples, and the generate_model function need to remember the last two words.
Easy: 6, 7, 8, 10, 14, 15
Intermediate: 19, 23, 25, 27, 29
Difficult: 39, 41
Word segmentation
Try out the word segmentation implementation in section 3.8, on a bigger example corpus. E.g., try the first N words in the Brown corpus. (Start with a relative small value of N such as 100, then increase successively until it takes too long.)
Easy: 2, 3, 7, 9, 10
Intermediate: 12, 13, 15, 16, 17, 21
Difficult: 29
Intermediate: 14, 15, 17, 19, 20, 29
Easy: 2, 4, 5
Easy: 1, 2
Intermediate: 4, 5
Difficult: 11, 12
Easy: 4, 5, 6, 7, 9, 13
Intermediate: 24
Difficult: 30, 35
Easy: 1, 2
Intermediate: 6, 8, 9, 12