NLTK classes

Here are interesting functions and classes with their most important methods.

General functions

- nltk.clean_html(htmlstring) => a string

nltk.tokenwrap(tokens, separator=' ', width=70)
- nltk.re_show(regexp, string) => a string
- nltk.data.find("path to a NLTK resource") => a file system path
- nltk.data.load("path to a NLTK resource", format="auto") => an NLTK object
  - these objects can be different kinds of grammars, logic formulas, Pickled object, YAML objects, etcetera

Statistical functions

- nltk.bigrams(sequence, pad_left=False, pad_right=False, pad_symbol=None) => list of bigram tuples
- nltk.trigrams(sequence, pad_left=False, pad_right=False, pad_symbol=None) => list of trigram tuples
- nltk.ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None) => list of ngram tuples
- nltk.bigrams(seq) == nltk.ngrams(seq, 2)
- nltk.trigrams(seq) == nltk.ngrams(seq, 3)

Functions for tokenization and parsing

- nltk.word_tokenize(string) => a list of strings
  - this is NLTK's default word tokenizer, only useful for English text
- nltk.sent_tokenize(string) => a list of strings
  - this is NLTK's default sentence tokenizer, only useful for English text
- nltk.regexp_tokenize(text, pattern) => a string
- nltk.parse_cfg(grammar-string) => a grammar object
  - parses the string as a CFG and returns a nltk.ContextFreeGrammar
- nltk.parse_fcfg(grammar-string) => a grammar object
  - parses the string as a FCFG and returns a nltk.FeatureGrammar
- nltk.load_parser("path to a NLTK grammar", trace=0) => a parser object
  - loads the grammar and creates a parser for it, either a nltk.ChartParser or nltk.FeatureChartParser

Text classes

nltk.Text(tokens)

.concordance(word, width=79, lines=25)
- - uses nltk.ConcordanceIndex(tokens)
.findall(regexp)

- uses nltk.TokenSearcher(tokens)

.similar(word, num=20)
.common_contexts(words, num=20)

- both use nltk.ContextIndex(tokens)

.collocations(num=20, window_size=2)

- uses nltk.BigramCollocationFinder.from_words(tokens, window_size)
  - with methods .apply_freq_filter, .apply_word_filter and .nbest
- and nltk.BigramAssocMeasures().likelihood_ratio

.dispersion_plot(words)
- - uses nltk.draw.dispersion_plot(words)
.generate(length=100)
- - uses nltk.NgramModel and nltk.LidstoneProbDist

nltk.Index(sequence of (key, index) pairs)

- an Index is a dictionary with list values: each key maps to a list of indices
- the methods are the same as ordinary dictionary methods, only that key lookup always returns a list

Statistical classes

nltk.FreqDist(samples)

- fdist[sample] => an integer: the number of occurrences
- fdist1 < fdist2 => True if all frequencies in fdist1 is less than the ones in fdist2

for sample in fdist
- - loop through all samples in fdist, in frequency order

- .keys() => a list of all samples: sorted in frequency order
- .items() => a list of (sample, frequency) pairs: sorted in frequency order
- .max() => the most common sample
- fd.max() == fd.keys()[0]
- .B() => an integer: the total number of unique sample values (or bins)
- fdist.B() == len(fdist) == len(set(samples))
- .N() => an integer: the total number of samples
- fdist.N() == len(samples)
- .Nr(r) => an integer: the number of samples with count r
- fd.Nr(r) == sum(r==fd[sample] for sample in fd)
- .hapaxes() => a list of all samples that occur only once
- fdist.Nr(1) == len(fdist.hapaxes())
- .freq(sample) => a float between 0.0 and 1.0: the relative frequency of the sample
- fdist.freq(sample) == fdist[sample] / fdist.N()
- .inc(sample)
  - increase the count for the given sample
- .update(samples)
  - increases the counts for all given samples

.plot(title="some title", cumulative=False)

nltk.ConditionalFreqDist(sequence of (condition, sample) pairs)

Note: Due to a bug in NLTK, you cannot loop like this: (for c in cfd). Instead you must loop like this: (for c in cfd.conditions()).

- cfd[condition] => the FreqDist for the given condition
  - cfd[condition].inc(sample) increases the count for the given condition and sample
- .conditions() => a list of all conditions
- len(cfd) => an integer: the number of conditions
- len(cfd) == len(cfd.conditions())
- .N() => an integer: the total number of occurrences for all contained FreqDists
- cfd.N() == sum(cfd[c].N() for c in cfd.conditions())

.tabulate(title="some title, conditions=[list,of,conditions], samples=[list,of,samples])
- - prints a table for the given conditions and samples
  - default is to use all conditions/samples, which can give a very large table
.plot(title="some title, conditions=[list,of,conditions], samples=[list,of,samples], cumulative=False)

nltk.ConditionalProbDist(cfdist, probdist, bins=nr-of-bins)

Note: This is NLTK's way of creating conditional probabilities, P(A | B). You start with a cond.freq.dist cfdist from which you can get the counts C(B,A) = cfdist[B][A], and C(B) = cfdist[B].N(). The probdist then decides how to use these counts when calculating P(A | B):

- MLEProbDist uses P(A | B) = C(B,A) / C(B) = cfdist[B][A] / cfdist[B].N()
- LaplaceProbDist uses P(A | B) = (C(B,A)+1) / (C(B)+V) = (cfdist[B][A]+1) / (cfdist[B].N()+cfdist[B].B())

LidstoneProbDist, WittenBellProbDist and SimpleGoodTuringProbDist are other more advanced examples
the bins parameter is necessary whenever we want to calculate probabilities for events that are never seen before

nltk.MLEProbDist(freqdist, bins=None)

nltk.LaplaceProbDist(freqdist, bins=None)

nltk.LidstoneProbDist(freqdist, gamma, bins=None)

nltk.WittenBellProbDist(freqdist, bins=None)

nltk.SimpleGoodTuringProbDist(freqdist, bins=None)

- .max() => the sample with the greatest probability
  - probdist.max() == freqdist.max(), if probdist is created from freqdist
- .prob(sample) => a float between 0.0 and 1.0
  - note: this not equal to freqdist.freq(sample), unless for MLEProbDist
- .logprob(sample) => a float below 0.0, the base-2 logarithm of the probability
- pd.logprob(sample) == math.log(pd.prob(sample), 2)
- .samples() => a list of the samples in the underlying freqdist
  - probdist.samples() == freqdist.keys(), but perhaps in another order
- .generate() => a randomly generated sample

Classes for tokenization and tagging

nltk.BlanklineTokenizer()

nltk.WordPunctTokenizer()

nltk.RegexpTokenizer(pattern, ...)

nltk.PunktSenteceTokenizer(train_text=None, ...)

nltk.TreebankWordTokenizer()

- .tokenize(string) => list of strings
- .batch_tokenize(list-of-strings) => list of (lists of strings)
- .span_tokenize(string) => iterator of (start-position,end-position) pairs
- .batch_span_tokenize(list-of-string) => iterator of (lists of (start-position,end-position) pairs)

nltk.DefaultTagger(tag)

nltk.AffixTagger(list-of-train-sentences, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)

nltk.RegexpTagger(list-of-regexps, backoff=None)

nltk.UnigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)

nltk.BigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)

nltk.TrigramTagger(list-of-train-sentences, backoff=None, cutoff=0, verbose=False)

nltk.NgramTagger(n, list-of-train-sentences, backoff=None, cutoff=0, verbose=False)

- the list-of-regexps for the RegexpTagger is a list of (regexp, tag) pairs
- .tag(list-of-tokens) => list of (token, tag) pairs
- .batch_tag(list-of-sentences) => list of (lists of (token, tag) pairs)
- .evaluate(list-of-gold-sentences) => float between 0.0 and 1.0
  - the input is a list of (lists of (token, tag) pairs)
- .backoff => the given backoff tagger

Classes for trees and grammars

nltk.Tree(description) or nltk.Tree(node, children)

- NLTK trees are lists with a dedicated .node attribute.
- all list methods work, such as .append(child), .sort(), .reverse()
- list indexing works, tree[nr], tree[nr]=newtree
  - you can use a sequence of ints if you want to lookup nested children: tree[i,j,k,l]
- .node => string: the node value
  - the node can also be assigned to: tree.node = "whatever"
- .copy(deep=False) => a (shallow or deep) copy of the tree
- .height() => an integer: the longest path from the root to some leaf
- .leaves() => a list containing the leaves of the tree
- .leaf_treeposition(leafnr) => a sequence: the tree position of the nth leaf
- tree[tree.leaf_treeposition[k]] == tree.leaves()[k]
- .productions() => a list of grammar productions corresponding to the non-terminal nodes in the tree
- .subtrees(filter=None) => generate all subtrees, optionally restricted to trees matching the filter function
- .treepositions(order='preorder') => a list of all occupied positions (as sequences of ints) in the tree
  - order can be 'preorder' or 'postorder'
- .pprint(margin=70, indent=0, nodesep='', parens='()', quotes=False) => string

.draw()
- - if you want to draw several trees at once, use nltk.draw.tree.draw_trees(tree1, tree2, ...)

nltk.ContextFreeGrammar(start, list-of-productions)

nltk.grammar.FeatureGrammar(start, list-of-productions)

- the common ways of creating CFGs and FCFGs are by any of the following functions:
- nltk.parse_cfg(grammar-string)
- nltk.parse_fcfg(grammar-string)
- nltk.data.load(path-to-grammar-file)
- .start() => the starting symbol
- .productions(lhs=None, rhs=None, empty=False) => a list of nltk.Production
  - if you specify the lhs, rhs, empty arguments, they act as filters that only return the productions that match.
- .is_binarised() .is_chomsky_normal_form() .is_flexible_chomsky_normal_form() .is_lexical() .is_nonlexical() .is_nonempty() => boolean

nltk.Production(lhs, rhs)

- .lhs() => a nltk.Nonterminal: the left-hand side
- .rhs() => a list of (nltk.Nonterminal or string): the right-hand side
- len(prod) => integer: the length of the right-hand side
- .is_lexical() .is_nonlexical() => boolean

nltk.Nonterminal(string)

nltk.FeatStructNonterminal(string)

- These are just wrapper classes to be able to distinguish between terminals (strings) and non-terminals (also strings).

nltk.FeatStruct(feature-structure-description)

- .copy(deep=True) => a (deep) copy of the feature structure
- .cyclic() => boolean
- .unify(other-feature-structure, trace=False, ...) => nltk.FeatStruct, or None if the feature structures are not unifiable
  - fs1.unify(fs2) is the same as nltk.unify(fs1, fs2)
  - for more information, try help(nltk.unify)
- .subsumes(feature-structure) => boolean
- fs1.subsumes(fs2) == (fs1.unify(fs2) == fs2)

Classes for parsing and chunking

nltk.RegexpParser(regexp-grammar-string)

.parse(sentence, trace=None) => nltk.Tree, or None
- sentence is a list of strings, or a nltk.Tree
.batch_parse(list-of-sentences) => list of (nltk.Tree or None)

nltk.ChartParser()

nltk.IncrementalChartParser()

nltk.FeatureChartParser()

(plus lots of other chart parser classes)
.grammar() => nltk.ContextFreeGrammar
.nbest_parse(sentence, n=None) => list of nltk.Tree
- the default is to return all parse trees, but if n>0, at most n trees are returned
- .parse(sentence) => nltk.Tree, or None
- .chart_parse(sentence, trace=None) => nltk.Chart, or nltk.IncrementalChart
- .batch_parse(list-of-sentences) => list of (nltk.Tree or None)
.batch_nbest_parse(list-of-sentences, n=None) => list of (lists of nltk.Tree)

NLTK corpora

nltk.corpus.xxx

For possible values of xxx, see table 2.2 in section 2.1, http://www.nltk.org/data, or below.

These are the common corpus methods, but note that they are not implemented by all corpora:

- .readme() => a string: the contents of the corpus readme file
- root => the directory where the corpus is stored
- .fileids() => the list of file ids (documents) that the corpus consists of
  - most methods below take an optional argument fileids=…, taking a single file id or a list of file ids
- .categories() => the list of categories that the corpus consists of
  - not all corpora are categorized
  - for some corpora (e.g. reuters), each document can have several categories
  - for other corpora (e.g. brown), each document has exactly one category
  - for categorized corpora, most methods below take an optional argument categories=…, taking a single category or a list of categories
- .raw() => a string: the whole corpus as a single string

Methods for plaintext corpora; e.g. abc, genesis, gutenberg, inaugural, machado, movie_reviews, reuters, shakespeare, treebank_raw, udhr, webtext.

- .words() => a list of words
- .sents() => a list of (list of words)
- .paras() => a list of (list of (list of words))

Additional methods for tagged corpora; e.g. brown, indian, jeita, mac_morpho, nps_chat, pl196x, timit_tagged.

- .tagged_words() => a list of (word, tag) pairs
- .tagged_sents() => a list of (list of (word, tag) pairs)
- .tagged_paras() => a list of (list of (list of (word, tag) pairs))

Further methods for parsed and/or chunked corpora. Parsed corpora include alpino, cess_cat, cess_esp, floresta, sinica_treebank, treebank. Dependency parsed corpora include conll2007, dependency_treebank. Chunked corpora include conll2000, conll2002, treebank_chunk.

- .parsed_sents() => a list of nltk.Tree or nltk.DependencyGraph, depending on corpus
- .chunked_sents() => a list of nltk.Tree
  - note that these trees are flat (not nested); read more in section 7.2
- .iob_sents() => a list of (word, tag, IOB) tuples
  - read more about the IOB format in section 7.2
- .chunked_paras() => returns the corpus readme file as a string
- .readme() => returns the corpus readme file as a string

Methods for dictionaries and wordlists; e.g. cmudict, gazetteers, names, stopwords, swadesh.

- .entries() => a list of tuples, depending on the corpus
- .words() => a list of words
  - note that in this case, the word list is not ordered

Finally, there are more specialized corpora with "non-standard" methods, e.g.

- comtrans contains word-aligned sentences; useful for machine translation
- ieer contains chunked documents; useful for information extraction
- ppattach is a PP attachment corpus
- qc is a corpus for question classification
- rte is a corpus for recognizing textual entailment (RTE)
- senseval is a corpus for word sense disambiguation (WSD)
- switchboard contains telephone dialogues
- timit is a corpus of read speech containing audio data
- wordnet, wordnet_ic and verbnet are semantical lexicons

Page updated

Google Sites

Report abuse