Here are interesting functions and classes with their most important methods.
nltk.clean_html(htmlstring) => a string
nltk.tokenwrap(tokens, separator=' ', width=70)
nltk.re_show(regexp, string) => a string
nltk.data.find("path to a NLTK resource") => a file system path
nltk.data.load("path to a NLTK resource", format="auto") => an NLTK object
these objects can be different kinds of grammars, logic formulas, Pickled object, YAML objects, etcetera
nltk.bigrams(sequence, pad_left=False, pad_right=False, pad_symbol=None) => list of bigram tuples
nltk.trigrams(sequence, pad_left=False, pad_right=False, pad_symbol=None) => list of trigram tuples
nltk.ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None) => list of ngram tuples
nltk.bigrams(seq) == nltk.ngrams(seq, 2)
nltk.trigrams(seq) == nltk.ngrams(seq, 3)
nltk.word_tokenize(string) => a list of strings
this is NLTK's default word tokenizer, only useful for English text
nltk.sent_tokenize(string) => a list of strings
this is NLTK's default sentence tokenizer, only useful for English text
nltk.regexp_tokenize(text, pattern) => a string
nltk.parse_cfg(grammar-string) => a grammar object
parses the string as a CFG and returns a nltk.ContextFreeGrammar
nltk.parse_fcfg(grammar-string) => a grammar object
parses the string as a FCFG and returns a nltk.FeatureGrammar
nltk.load_parser("path to a NLTK grammar", trace=0) => a parser object
loads the grammar and creates a parser for it, either a nltk.ChartParser or nltk.FeatureChartParser
nltk.Text(tokens)
.concordance(word, width=79, lines=25)
uses nltk.ConcordanceIndex(tokens)
.findall(regexp)
uses nltk.TokenSearcher(tokens)
.similar(word, num=20)
.common_contexts(words, num=20)
both use nltk.ContextIndex(tokens)
.collocations(num=20, window_size=2)
uses nltk.BigramCollocationFinder.from_words(tokens, window_size)
with methods .apply_freq_filter, .apply_word_filter and .nbest
and nltk.BigramAssocMeasures().likelihood_ratio
.dispersion_plot(words)
uses nltk.draw.dispersion_plot(words)
.generate(length=100)
uses nltk.NgramModel and nltk.LidstoneProbDist
nltk.Index(sequence of (key, index) pairs)
an Index is a dictionary with list values: each key maps to a list of indices
the methods are the same as ordinary dictionary methods, only that key lookup always returns a list
nltk.FreqDist(samples)
fdist[sample] => an integer: the number of occurrences
fdist1 < fdist2 => True if all frequencies in fdist1 is less than the ones in fdist2
for sample in fdist
loop through all samples in fdist, in frequency order
.keys() => a list of all samples: sorted in frequency order
.items() => a list of (sample, frequency) pairs: sorted in frequency order
.max() => the most common sample
fd.max() == fd.keys()[0]
.B() => an integer: the total number of unique sample values (or bins)
fdist.B() == len(fdist) == len(set(samples))
.N() => an integer: the total number of samples
fdist.N() == len(samples)
.Nr(r) => an integer: the number of samples with count r
fd.Nr(r) == sum(r==fd[sample] for sample in fd)
.hapaxes() => a list of all samples that occur only once
fdist.Nr(1) == len(fdist.hapaxes())
.freq(sample) => a float between 0.0 and 1.0: the relative frequency of the sample
fdist.freq(sample) == fdist[sample] / fdist.N()
.inc(sample)
increase the count for the given sample
.update(samples)
increases the counts for all given samples
.plot(title="some title", cumulative=False)
nltk.ConditionalFreqDist(sequence of (condition, sample) pairs)
Note: Due to a bug in NLTK, you cannot loop like this: (for c in cfd). Instead you must loop like this: (for c in cfd.conditions()).
cfd[condition] => the FreqDist for the given condition
cfd[condition].inc(sample) increases the count for the given condition and sample
.conditions() => a list of all conditions
len(cfd) => an integer: the number of conditions
len(cfd) == len(cfd.conditions())
.N() => an integer: the total number of occurrences for all contained FreqDists
cfd.N() == sum(cfd[c].N() for c in cfd.conditions())
.tabulate(title="some title, conditions=[list,of,conditions], samples=[list,of,samples])
prints a table for the given conditions and samples
default is to use all conditions/samples, which can give a very large table
.plot(title="some title, conditions=[list,of,conditions], samples=[list,of,samples], cumulative=False)
nltk.ConditionalProbDist(cfdist, probdist, bins=nr-of-bins)
Note: This is NLTK's way of creating conditional probabilities, P(A | B). You start with a cond.freq.dist cfdist from which you can get the counts C(B,A) = cfdist[B][A], and C(B) = cfdist[B].N(). The probdist then decides how to use these counts when calculating P(A | B):
MLEProbDist uses P(A | B) = C(B,A) / C(B) = cfdist[B][A] / cfdist[B].N()
LaplaceProbDist uses P(A | B) = (C(B,A)+1) / (C(B)+V) = (cfdist[B][A]+1) / (cfdist[B].N()+cfdist[B].B())
LidstoneProbDist, WittenBellProbDist and SimpleGoodTuringProbDist are other more advanced examples
the bins parameter is necessary whenever we want to calculate probabilities for events that are never seen before
nltk.MLEProbDist(freqdist, bins=None)
nltk.LaplaceProbDist(freqdist, bins=None)
nltk.LidstoneProbDist(freqdist, gamma, bins=None)
nltk.WittenBellProbDist(freqdist, bins=None)
nltk.SimpleGoodTuringProbDist(freqdist, bins=None)
.max() => the sample with the greatest probability
probdist.max() == freqdist.max(), if probdist is created from freqdist
.prob(sample) => a float between 0.0 and 1.0
note: this not equal to freqdist.freq(sample), unless for MLEProbDist
.logprob(sample) => a float below 0.0, the base-2 logarithm of the probability
pd.logprob(sample) == math.log(pd.prob(sample), 2)
.samples() => a list of the samples in the underlying freqdist
probdist.samples() == freqdist.keys(), but perhaps in another order
.generate() => a randomly generated sample
nltk.BlanklineTokenizer()
nltk.WordPunctTokenizer()
nltk.RegexpTokenizer(pattern, ...)
nltk.PunktSenteceTokenizer(train_text=None, ...)
nltk.TreebankWordTokenizer()
.tokenize(string) => list of strings
.batch_tokenize(list-of-strings) => list of (lists of strings)
.span_tokenize(string) => iterator of (start-position,end-position) pairs
.batch_span_tokenize(list-of-string) => iterator of (lists of (start-position,end-position) pairs)
nltk.DefaultTagger(tag)
nltk.AffixTagger(list-of-train-sentences, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)
nltk.RegexpTagger(list-of-regexps, backoff=None)
nltk.UnigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)
nltk.BigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)
nltk.TrigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)
nltk.NgramTagger(n, list-of-train-sentences, backoff=None, cutoff=0, verbose=False)
the list-of-regexps for the RegexpTagger is a list of (regexp, tag) pairs
.tag(list-of-tokens) => list of (token, tag) pairs
.batch_tag(list-of-sentences) => list of (lists of (token, tag) pairs)
.evaluate(list-of-gold-sentences) => float between 0.0 and 1.0
the input is a list of (lists of (token, tag) pairs)
.backoff => the given backoff tagger
nltk.Tree(description) or nltk.Tree(node, children)
NLTK trees are lists with a dedicated .node attribute.
all list methods work, such as .append(child), .sort(), .reverse()
list indexing works, tree[nr], tree[nr]=newtree
you can use a sequence of ints if you want to lookup nested children: tree[i,j,k,l]
.node => string: the node value
the node can also be assigned to: tree.node = "whatever"
.copy(deep=False) => a (shallow or deep) copy of the tree
.height() => an integer: the longest path from the root to some leaf
.leaves() => a list containing the leaves of the tree
.leaf_treeposition(leafnr) => a sequence: the tree position of the nth leaf
tree[tree.leaf_treeposition[k]] == tree.leaves()[k]
.productions() => a list of grammar productions corresponding to the non-terminal nodes in the tree
.subtrees(filter=None) => generate all subtrees, optionally restricted to trees matching the filter function
.treepositions(order='preorder') => a list of all occupied positions (as sequences of ints) in the tree
order can be 'preorder' or 'postorder'
.pprint(margin=70, indent=0, nodesep='', parens='()', quotes=False) => string
.draw()
if you want to draw several trees at once, use nltk.draw.tree.draw_trees(tree1, tree2, ...)
nltk.ContextFreeGrammar(start, list-of-productions)
nltk.grammar.FeatureGrammar(start, list-of-productions)
the common ways of creating CFGs and FCFGs are by any of the following functions:
nltk.parse_cfg(grammar-string)
nltk.parse_fcfg(grammar-string)
nltk.data.load(path-to-grammar-file)
.start() => the starting symbol
.productions(lhs=None, rhs=None, empty=False) => a list of nltk.Production
if you specify the lhs, rhs, empty arguments, they act as filters that only return the productions that match.
.is_binarised() .is_chomsky_normal_form() .is_flexible_chomsky_normal_form() .is_lexical() .is_nonlexical() .is_nonempty() => boolean
nltk.Production(lhs, rhs)
.lhs() => a nltk.Nonterminal: the left-hand side
.rhs() => a list of (nltk.Nonterminal or string): the right-hand side
len(prod) => integer: the length of the right-hand side
.is_lexical() .is_nonlexical() => boolean
nltk.Nonterminal(string)
nltk.FeatStructNonterminal(string)
These are just wrapper classes to be able to distinguish between terminals (strings) and non-terminals (also strings).
nltk.FeatStruct(feature-structure-description)
.copy(deep=True) => a (deep) copy of the feature structure
.cyclic() => boolean
.unify(other-feature-structure, trace=False, ...) => nltk.FeatStruct, or None if the feature structures are not unifiable
fs1.unify(fs2) is the same as nltk.unify(fs1, fs2)
for more information, try help(nltk.unify)
.subsumes(feature-structure) => boolean
fs1.subsumes(fs2) == (fs1.unify(fs2) == fs2)
nltk.RegexpParser(regexp-grammar-string)
.parse(sentence, trace=None) => nltk.Tree, or None
sentence is a list of strings, or a nltk.Tree
.batch_parse(list-of-sentences) => list of (nltk.Tree or None)
nltk.ChartParser()
nltk.IncrementalChartParser()
nltk.FeatureChartParser()
(plus lots of other chart parser classes)
.grammar() => nltk.ContextFreeGrammar
.nbest_parse(sentence, n=None) => list of nltk.Tree
the default is to return all parse trees, but if n>0, at most n trees are returned
.parse(sentence) => nltk.Tree, or None
.chart_parse(sentence, trace=None) => nltk.Chart, or nltk.IncrementalChart
.batch_parse(list-of-sentences) => list of (nltk.Tree or None)
.batch_nbest_parse(list-of-sentences, n=None) => list of (lists of nltk.Tree)
nltk.corpus.xxx
For possible values of xxx, see table 2.2 in section 2.1, http://www.nltk.org/data, or below.
These are the common corpus methods, but note that they are not implemented by all corpora:
.readme() => a string: the contents of the corpus readme file
root => the directory where the corpus is stored
.fileids() => the list of file ids (documents) that the corpus consists of
most methods below take an optional argument fileids=…, taking a single file id or a list of file ids
.categories() => the list of categories that the corpus consists of
not all corpora are categorized
for some corpora (e.g. reuters), each document can have several categories
for other corpora (e.g. brown), each document has exactly one category
for categorized corpora, most methods below take an optional argument categories=…, taking a single category or a list of categories
.raw() => a string: the whole corpus as a single string
Methods for plaintext corpora; e.g. abc, genesis, gutenberg, inaugural, machado, movie_reviews, reuters, shakespeare, treebank_raw, udhr, webtext.
.words() => a list of words
.sents() => a list of (list of words)
.paras() => a list of (list of (list of words))
Additional methods for tagged corpora; e.g. brown, indian, jeita, mac_morpho, nps_chat, pl196x, timit_tagged.
.tagged_words() => a list of (word, tag) pairs
.tagged_sents() => a list of (list of (word, tag) pairs)
.tagged_paras() => a list of (list of (list of (word, tag) pairs))
Further methods for parsed and/or chunked corpora. Parsed corpora include alpino, cess_cat, cess_esp, floresta, sinica_treebank, treebank. Dependency parsed corpora include conll2007, dependency_treebank. Chunked corpora include conll2000, conll2002, treebank_chunk.
.parsed_sents() => a list of nltk.Tree or nltk.DependencyGraph, depending on corpus
.chunked_sents() => a list of nltk.Tree
note that these trees are flat (not nested); read more in section 7.2
.iob_sents() => a list of (word, tag, IOB) tuples
read more about the IOB format in section 7.2
.chunked_paras() => returns the corpus readme file as a string
.readme() => returns the corpus readme file as a string
Methods for dictionaries and wordlists; e.g. cmudict, gazetteers, names, stopwords, swadesh.
.entries() => a list of tuples, depending on the corpus
.words() => a list of words
note that in this case, the word list is not ordered
Finally, there are more specialized corpora with "non-standard" methods, e.g.
comtrans contains word-aligned sentences; useful for machine translation
ieer contains chunked documents; useful for information extraction
ppattach is a PP attachment corpus
qc is a corpus for question classification
rte is a corpus for recognizing textual entailment (RTE)
senseval is a corpus for word sense disambiguation (WSD)
switchboard contains telephone dialogues
timit is a corpus of read speech containing audio data
wordnet, wordnet_ic and verbnet are semantical lexicons