Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎Text Mining‎ > ‎NLP‎ > ‎NLP with Python NLTK‎ > ‎

Dive into NLTK Part I

NLTK is the most famous Python Natural Language Processing Toolkit, here I will give a detail tutorial about NLTK. This is the first article in a series where I will write everything about NLTK with Python, especially about text mining and text analysis online.

This is the first article in the series “Dive Into NLTK”, here is an index of all the articles in the series that have been published to date:

Part I: Getting Started with NLTK (this article)
Part II: Sentence Tokenize and Word Tokenize
Part III: Part-Of-Speech Tagging and POS Tagger
Part IV: Stemming and Lemmatization
Part V: Using Stanford Text Analysis Tools in Python
Part VI: Add Stanford Word Segmenter Interface for Python NLTK
Part VII: A Preliminary Study on Text Classification
Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification
Part IX: From Text Classification to Sentiment Analysis
Part X: Play With Word2Vec Models based on NLTK Corpus

About NLTK

Here is a description from the NLTK official site:

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Installing NLTK

The following step is test on my mac os and a vps with ubuntu 12.04, just require your computer with Python 2.6 or Python 2.7, but I did’t test it on a windows computer. And I assume you could write some python code, and familiarity with Python modules and packages is also recommended. Here is the step to install NLTK on Mac/Unix:

Install Setuptools: http://pypi.python.org/pypi/setuptools
Install Pip: run sudo easy_install pip
Install Numpy (optional): run sudo pip install -U numpy
Install PyYAML and NLTK: run sudo pip install -U pyyaml nltk
Test installation: run python then type import nltk

Installing NLTK Data
After installing NLTK, you need install NLTK Data which include a lot of corpora, grammars, models and etc. Without NLTK Data, NLTK is nothing. You can find the complete nltk data list here: http://nltk.org/nltk_data/

The simplest way to install NLTK Data is run the Python interpreter and type the commands, following example is running on Mac Os:
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import nltk
>>> nltk.download()

A new window should open, showing the NLTK Downloader on Mac(Maybe same on Windows):

nltk_downloader_on_mac

Click on the File menu and select Change Download Directory, next, select the packages or collections you want to download, we suggest you select the “all” and download everything NLTK needed.

Graphical interface

If you install NLTK Data in a linux vps, no graphical interface, no window open, you still can use above nltk.download() command, you can following the follow step to download all nltk_data:

Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Download which package (l=list; x=cancel)?
Downloader> l

Packages:
[*] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[*] abc................. Australian Broadcasting Commission 2006
[*] alpino.............. Alpino Dutch Treebank
[*] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
Extraction Systems in Biology)
[*] brown............... Brown Corpus
[*] brown_tei........... Brown Corpus (TEI XML Version)
[*] cess_cat............ CESS-CAT Treebank
[*] cess_esp............ CESS-ESP Treebank
[*] chat80.............. Chat-80 Data Files
[*] city_database....... City Database
[*] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[*] comtrans............ ComTrans Corpus Sample
[*] conll2000........... CONLL 2000 Chunking Corpus
[*] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[*] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[*] dependency_treebank. Dependency Parsed Treebank
[*] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
Hit Enter to continue:
....
Downloader> d

Download which package (l=list; x=cancel)?
Identifier> all

If you download everything(corpora, models, grammar) NLTK needed, you can test it by running:

Downloader> u

If showing “Nothing to update”, everything is ok.

Another way to install NLTK Data is using the command, I didn’t test this way, following is from official site:

Python 2.5-2.7: Run the command python -m nltk.downloader all. To ensure central installation, run the command sudo python -m nltk.downloader -d /usr/share/nltk_data all.

If you met the problem when downloading NLTK Data, such as download time out or other strange things, I suggest you download the NLTK data directly by nltk_data github page:

https://github.com/nltk/nltk_data

It said that “NLTK Data lives in the gh-pages branch of this repository.”, so you can visit the branch:

https://github.com/nltk/nltk_data/tree/master

Download the zip file and unzip it, then copy the six sub-directory in the packages into your nltk_data directory: chunkers, corpora, help, stemmers, taggers, tokenizers

Maybe this is the best unofficial way to install NLTK_Data.

Test NLTK

1) Test Brown Corpus:

>> from nltk.corpus import brown
>>> brown.words()[0:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
>>> brown.tagged_words()[0:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
>>> len(brown.words())
1161192
>>> dir(brown)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_add', '_c2f', '_delimiter', '_encoding', '_f2c', '_file', '_fileids', '_get_root', '_init', '_map', '_para_block_reader', '_pattern', '_resolve', '_root', '_sent_tokenizer', '_sep', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'categories', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'tagged_paras', 'tagged_sents', 'tagged_words', 'words']

2) Test NLTK Book Resources:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

>>> dir(text1)
['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_collocations', '_context', '_num', '_vocab', '_window_size', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'vocab']
>>> len(text1)
260819
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

3) Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging:

>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."
>>> sents = sent_tokenize(text)
>>> sents
['Machine learning is the science of getting computers to act without being explicitly programmed.', 'In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome.', 'Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.', 'Many researchers also think it is the best way to make progress towards human-level AI.', 'In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself.', "More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems.", "Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI."]
>>> len(sents)
7
>>> tokens = word_tokenize(text)
>>> tokens
['Machine', 'learning', 'is', 'the', 'science', 'of', 'getting', 'computers', 'to', 'act', 'without', 'being', 'explicitly', 'programmed.', 'In', 'the', 'past', 'decade', ',', 'machine', 'learning', 'has', 'given', 'us', 'self-driving', 'cars', ',', 'practical', 'speech', 'recognition', ',', 'effective', 'web', 'search', ',', 'and', 'a', 'vastly', 'improved', 'understanding', 'of', 'the', 'human', 'genome.', 'Machine', 'learning', 'is', 'so', 'pervasive', 'today', 'that', 'you', 'probably', 'use', 'it', 'dozens', 'of', 'times', 'a', 'day', 'without', 'knowing', 'it.', 'Many', 'researchers', 'also', 'think', 'it', 'is', 'the', 'best', 'way', 'to', 'make', 'progress', 'towards', 'human-level', 'AI.', 'In', 'this', 'class', ',', 'you', 'will', 'learn', 'about', 'the', 'most', 'effective', 'machine', 'learning', 'techniques', ',', 'and', 'gain', 'practice', 'implementing', 'them', 'and', 'getting', 'them', 'to', 'work', 'for', 'yourself.', 'More', 'importantly', ',', 'you', "'ll", 'learn', 'about', 'not', 'only', 'the', 'theoretical', 'underpinnings', 'of', 'learning', ',', 'but', 'also', 'gain', 'the', 'practical', 'know-how', 'needed', 'to', 'quickly', 'and', 'powerfully', 'apply', 'these', 'techniques', 'to', 'new', 'problems.', 'Finally', ',', 'you', "'ll", 'learn', 'about', 'some', 'of', 'Silicon', 'Valley', "'s", 'best', 'practices', 'in', 'innovation', 'as', 'it', 'pertains', 'to', 'machine', 'learning', 'and', 'AI', '.']
>>> len(tokens)
161
>>> tagged_tokens = pos_tag(tokens)
>>> tagged_tokens
[('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), ('without', 'IN'), ('being', 'VBG'), ('explicitly', 'RB'), ('programmed.', 'NNP'), ('In', 'NNP'), ('the', 'DT'), ('past', 'JJ'), ('decade', 'NN'), (',', ','), ('machine', 'NN'), ('learning', 'NN'), ('has', 'VBZ'), ('given', 'VBN'), ('us', 'PRP'), ('self-driving', 'JJ'), ('cars', 'NNS'), (',', ','), ('practical', 'JJ'), ('speech', 'NN'), ('recognition', 'NN'), (',', ','), ('effective', 'JJ'), ('web', 'NN'), ('search', 'NN'), (',', ','), ('and', 'CC'), ('a', 'DT'), ('vastly', 'RB'), ('improved', 'VBN'), ('understanding', 'NN'), ('of', 'IN'), ('the', 'DT'), ('human', 'JJ'), ('genome.', 'NNP'), ('Machine', 'NNP'), ('learning', 'NN'), ('is', 'VBZ'), ('so', 'RB'), ('pervasive', 'JJ'), ('today', 'NN'), ('that', 'WDT'), ('you', 'PRP'), ('probably', 'RB'), ('use', 'VBP'), ('it', 'PRP'), ('dozens', 'VBZ'), ('of', 'IN'), ('times', 'NNS'), ('a', 'DT'), ('day', 'NN'), ('without', 'IN'), ('knowing', 'NN'), ('it.', 'NNP'), ('Many', 'NNP'), ('researchers', 'NNS'), ('also', 'RB'), ('think', 'VBP'), ('it', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('best', 'JJS'), ('way', 'NN'), ('to', 'TO'), ('make', 'VB'), ('progress', 'NN'), ('towards', 'NNS'), ('human-level', 'JJ'), ('AI.', 'NNP'), ('In', 'NNP'), ('this', 'DT'), ('class', 'NN'), (',', ','), ('you', 'PRP'), ('will', 'MD'), ('learn', 'VB'), ('about', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('effective', 'JJ'), ('machine', 'NN'), ('learning', 'NN'), ('techniques', 'NNS'), (',', ','), ('and', 'CC'), ('gain', 'NN'), ('practice', 'NN'), ('implementing', 'VBG'), ('them', 'PRP'), ('and', 'CC'), ('getting', 'VBG'), ('them', 'PRP'), ('to', 'TO'), ('work', 'VB'), ('for', 'IN'), ('yourself.', 'NNP'), ('More', 'NNP'), ('importantly', 'RB'), (',', ','), ('you', 'PRP'), ("'ll", 'MD'), ('learn', 'VB'), ('about', 'IN'), ('not', 'RB'), ('only', 'RB'), ('the', 'DT'), ('theoretical', 'JJ'), ('underpinnings', 'NNS'), ('of', 'IN'), ('learning', 'VBG'), (',', ','), ('but', 'CC'), ('also', 'RB'), ('gain', 'VBP'), ('the', 'DT'), ('practical', 'JJ'), ('know-how', 'NN'), ('needed', 'VBN'), ('to', 'TO'), ('quickly', 'RB'), ('and', 'CC'), ('powerfully', 'RB'), ('apply', 'RB'), ('these', 'DT'), ('techniques', 'NNS'), ('to', 'TO'), ('new', 'JJ'), ('problems.', 'NNP'), ('Finally', 'NNP'), (',', ','), ('you', 'PRP'), ("'ll", 'MD'), ('learn', 'VB'), ('about', 'IN'), ('some', 'DT'), ('of', 'IN'), ('Silicon', 'NNP'), ('Valley', 'NNP'), ("'s", 'POS'), ('best', 'JJS'), ('practices', 'NNS'), ('in', 'IN'), ('innovation', 'NN'), ('as', 'IN'), ('it', 'PRP'), ('pertains', 'VBZ'), ('to', 'TO'), ('machine', 'NN'), ('learning', 'NN'), ('and', 'CC'), ('AI', 'NNP'), ('.', '.')]

A lot of text mining or text analysis things NLTK can do, we will introduce them in the following articles.


This is the second article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Tokenizers is used to divide strings into lists of substrings. For example, Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.

Tokenizing text into sentences

Sentence Tokenize also known as Sentence boundary disambiguation, Sentence boundary detection, Sentence segmentation, here is the definition by wikipedia:

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

There are many nlp tools include the sentence tokenize function, such as OpenNLP,NLTK, TextBlob, MBSP and etc. Here we will tell the details sentence segmentation by NLTK.

How to use sentence tokenize in NLTK?

After installing nltk and nltk_data , you can launch python and import sent_tokenize tool from nltk:

>>> text = “this’s a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn.”
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize_list = sent_tokenize(text)
>>> len(sent_tokenize_list)
5
>>> sent_tokenize_list
[“this’s a sent tokenize test.”, ‘this is sent two.’, ‘is this sent three?’, ‘sent 4 is cool!’, “Now it’s your turn.”]
>>>

sent_tokenize uses an instance of PunktSentenceTokenizer from the nltk. tokenize.punkt module. This instance has already been trained on and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

sent_tokenize is one of instances of PunktSentenceTokenizer from the nltk.tokenize.punkt module. Tokenize Punkt module has many pre-trained tokenize model for many european languages, here is the list from the
nltk_data/tokenizers/punkt/README file:

Pretrained Punkt Models — Jan Strunk (New version trained after issues 313 and 514 had been corrected)

Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.

For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation

There are pretrained tokenizers for the following languages:

File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
———————————————————————————————————————————————————————–
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
———————————————————————————————————————————————————————–
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
———————————————————————————————————————————————————————–
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
———————————————————————————————————————————————————————–
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
———————————————————————————————————————————————————————–
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
———————————————————————————————————————————————————————–
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
———————————————————————————————————————————————————————–
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses “ss”
instead of “ß”)
———————————————————————————————————————————————————————–
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
———————————————————————————————————————————————————————–
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
———————————————————————————————————————————————————————–
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
———————————————————————————————————————————————————————–
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
———————————————————————————————————————————————————————–
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
———————————————————————————————————————————————————————–
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
———————————————————————————————————————————————————————–
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
———————————————————————————————————————————————————————–
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
———————————————————————————————————————————————————————–
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
———————————————————————————————————————————————————————–

The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.

—- Training Code —-

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus (one example: Slovene)
import codecs
text = codecs.open(“slovene.plain”,”Ur”,”iso-8859-2″).read()

# Train tokenizer
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open(“slovene.pickle”,”wb”)
pickle.dump(tokenizer, out)
out.close()

———

There are total 17 european languages that NLTK support for sentence tokenize, and you can use them as the following steps:

>>> import nltk.data
>>> tokenizer = nltk.data.load(‘tokenizers/punkt/english.pickle’)
>>> tokenizer.tokenize(text)
[“this’s a sent tokenize test.”, ‘this is sent two.’, ‘is this sent three?’, ‘sent 4 is cool!’, “Now it’s your turn.”]

Here is a spanish sentence tokenize example:
>>> spanish_tokenizer = nltk.data.load(‘tokenizers/punkt/spanish.pickle’)
>>> spanish_tokenizer.tokenize(‘Hola amigo. Estoy bien.’)
[‘Hola amigo.’, ‘Estoy bien.’]
>>>

Tokenizing text into words

Tokenizing text into words in NLTK is very simple, just called word_tokenize from nltk.tokenize module:

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize(‘Hello World.’)
[‘Hello’, ‘World’, ‘.’]
>>> word_tokenize(“this’s a test”)
[‘this’, “‘s”, ‘a’, ‘test’]

Actually, word_tokenize is a wrapper function that calls tokenize by the TreebankWordTokenizer, here is the code in NLTK:

# Standard word tokenizer.
_word_tokenize = TreebankWordTokenizer().tokenize
def word_tokenize(text):
  """
  Return a tokenized copy of *text*,
  using NLTK's recommended word tokenizer
  (currently :class:`.TreebankWordTokenizer`).
  This tokenizer is designed to work on a sentence at a time.
  """
  return _word_tokenize(text)

Another equivalent call method like the following:
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize(“this’s a test”)
[‘this’, “‘s”, ‘a’, ‘test’]

Except the TreebankWordTokenizer, there are other alternative word tokenizers, such as PunktWordTokenizer and WordPunktTokenizer.

PunktTokenizer splits on punctuation, but keeps it with the word:

>>> from nltk.tokenize import PunktWordTokenizer
>>> punkt_word_tokenizer = PunktWordTokenizer()
>>> punkt_word_tokenizer.tokenize(“this’s a test”)
[‘this’, “‘s”, ‘a’, ‘test’]

WordPunctTokenizer splits all punctuations into separate tokens:

>>> from nltk.tokenize import WordPunctTokenizer
>>> word_punct_tokenizer = WordPunctTokenizer()
>>> word_punct_tokenizer.tokenize(“This’s a test”)
[‘This’, “‘”, ‘s’, ‘a’, ‘test’]

You can choose any word tokenizer in nltk for your using purpose.


This is the third article in the series “Dive Into NLTK“, here is an index of all the articles in the series that have been published to date:

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Here is the definition from wikipedia:

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

How to use POS Tagging in NLTK
After import NLTK in python interpreter, you should use word_tokenize before pos tagging, which referred as pos_tag method:

>>> import nltk
>>> text = nltk.word_tokenize(“Dive into NLTK: Part-of-speech tagging and POS Tagger”)
>>> text
[‘Dive’, ‘into’, ‘NLTK’, ‘:’, ‘Part-of-speech’, ‘tagging’, ‘and’, ‘POS’, ‘Tagger’]
>>> nltk.pos_tag(text)
[(‘Dive’, ‘JJ’), (‘into’, ‘IN’), (‘NLTK’, ‘NNP’), (‘:’, ‘:’), (‘Part-of-speech’, ‘JJ’), (‘tagging’, ‘NN’), (‘and’, ‘CC’), (‘POS’, ‘NNP’), (‘Tagger’, ‘NNP’)]

NLTK provides documentation for each tag, which can be queried using the tag, e.g., nltk.help.upenn_tagset(‘RB’), or a regular expression, e.g., nltk.help.upenn_brown_tagset(‘NN.*’):

>>> nltk.help.upenn_tagset(‘JJ’)
JJ: adjective or numeral, ordinal
third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary …
>>> nltk.help.upenn_tagset(‘IN’)
IN: preposition or conjunction, subordinating
astride among uppon whether out inside pro despite on by throughout
below within for towards near behind atop around if like until below
next into if beside …
>>> nltk.help.upenn_tagset(‘NNP’)
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool …
>>>

NLTK also provide batch pos tagging method for document pos tagging, which is batch_pos_tag method:

>>> nltk.batch_pos_tag([[‘this’, ‘is’, ‘batch’, ‘tag’, ‘test’], [‘nltk’, ‘is’, ‘text’, ‘analysis’, ‘tool’]])
[[(‘this’, ‘DT’), (‘is’, ‘VBZ’), (‘batch’, ‘NN’), (‘tag’, ‘NN’), (‘test’, ‘NN’)], [(‘nltk’, ‘NN’), (‘is’, ‘VBZ’), (‘text’, ‘JJ’), (‘analysis’, ‘NN’), (‘tool’, ‘NN’)]]
>>>

The pre-trained POS Tagger Model in NLTK

You can find the pre-trained POS Tagging Model in nltk_data/taggers:

YangtekiMacBook-Pro:taggers textminer$ pwd
/Users/textminer/nltk_data/taggers
YangtekiMacBook-Pro:taggers textminer$ ls
total 11304
drwxr-xr-x 3 textminer staff 102 7 9 14:06 hmm_treebank_pos_tagger
-rw-r–r– 1 textminer staff 750857 5 26 2013 hmm_treebank_pos_tagger.zip
drwxr-xr-x 3 textminer staff 102 7 24 2013 maxent_treebank_pos_tagger
-rw-r–r– 1 textminer staff 5031883 5 26 2013 maxent_treebank_pos_tagger.zip

The default pos tagger model using in NLTK is maxent_treebanck_pos_tagger model, you can find the code in nltk-master/nltk/tag/__init__.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
 
        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]
 
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)
 
def batch_pos_tag(sentences):
    """
    Use NLTK's currently recommended part of speech tagger to tag the
    given list of sentences, each consisting of a list of tokens.
    """
    tagger = load(_POS_TAGGER)
    return tagger.batch_tag(sentences)

How to train a POS Tagging Model or POS Tagger in NLTK
You have used the maxent treebank pos tagging model in NLTK by default, and NLTK provides not only the maxent pos tagger, but other pos taggers like crf, hmm, brill, tnt and interfaces with stanford pos tagger, hunpos pos tagger and senna postaggers:

-rwxr-xr-x@ 1 textminer staff 4.4K 7 22 2013 __init__.py
-rwxr-xr-x@ 1 textminer staff 2.9K 7 22 2013 api.py
-rwxr-xr-x@ 1 textminer staff 56K 7 22 2013 brill.py
-rwxr-xr-x@ 1 textminer staff 31K 7 22 2013 crf.py
-rwxr-xr-x@ 1 textminer staff 48K 7 22 2013 hmm.py
-rwxr-xr-x@ 1 textminer staff 5.1K 7 22 2013 hunpos.py
-rwxr-xr-x@ 1 textminer staff 11K 7 22 2013 senna.py
-rwxr-xr-x@ 1 textminer staff 26K 7 22 2013 sequential.py
-rwxr-xr-x@ 1 textminer staff 3.3K 7 22 2013 simplify.py
-rwxr-xr-x@ 1 textminer staff 6.4K 7 22 2013 stanford.py
-rwxr-xr-x@ 1 textminer staff 18K 7 22 2013 tnt.py
-rwxr-xr-x@ 1 textminer staff 2.3K 7 22 2013 util.py

Here we will show you how to train a TnT POS Tagger Model, you can find the details about TnT in tnt.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Natural Language Toolkit: TnT Tagger
#
# Copyright (C) 2001-2013 NLTK Project
# Author: Sam Huston <sjh900@gmail.com>
#
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
 
'''
Implementation of 'TnT - A Statisical Part of Speech Tagger'
by Thorsten Brants
 
http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf
'''
from __future__ import print_function
from math import log
 
from operator import itemgetter
 
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.tag.api import TaggerI
 
class TnT(TaggerI):
    '''
    TnT - Statistical POS tagger
 
    IMPORTANT NOTES:
 
    * DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
 
      - It is possible to provide an untrained POS tagger to
        create tags for unknown words, see __init__ function
 
    * SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
 
      - Due to the nature of this tagger, it works best when
        trained over sentence delimited input.
      - However it still produces good results if the training
        data and testing data are separated on all punctuation eg: [,.?!]
      - Input for training is expected to be a list of sentences
        where each sentence is a list of (word, tag) tuples
      - Input for tag function is a single sentence
        Input for tagdata function is a list of sentences
        Output is of a similar form
 
    * Function provided to process text that is unsegmented
 
      - Please see basic_sent_chop()
 
 
    TnT uses a second order Markov model to produce tags for
    a sequence of input, specifically:
 
      argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)
 
    IE: the maximum projection of a set of probabilities
 
    The set of possible tags for a given word is derived
    from the training data. It is the set of all tags
    that exact word has been assigned.
 
    To speed up and get more precision, we can use log addition
    to instead multiplication, specifically:
 
      argmax [Sigma(log(P(t_i|t_i-1,t_i-2))+log(P(w_i|t_i)))] +
             log(P(t_T+1|t_T))
 
    The probability of a tag for a given word is the linear
    interpolation of 3 markov models; a zero-order, first-order,
    and a second order model.
 
      P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
                             l3*P(t_i| t_i-1, t_i-2)
 
    A beam search is used to limit the memory usage of the algorithm.
    The degree of the beam can be changed using N in the initialization.
    N represents the maximum number of possible solutions to maintain
    while tagging.
 
    It is possible to differentiate the tags which are assigned to
    capitalized words. However this does not result in a significant
    gain in the accuracy of the results.
    '''

First you need the train dada and test data, we use the treebank data from nltk.corpus:

>>> from nltk.corpus import treebank
>>> len(treebank.tagged_sents())
3914
>>> train_data = treebank.tagged_sents()[:3000]
>>> test_data = treebank.tagged_sents()[3000:]
>>> train_data[0]
[(u’Pierre’, u’NNP’), (u’Vinken’, u’NNP’), (u’,’, u’,’), (u’61’, u’CD’), (u’years’, u’NNS’), (u’old’, u’JJ’), (u’,’, u’,’), (u’will’, u’MD’), (u’join’, u’VB’), (u’the’, u’DT’), (u’board’, u’NN’), (u’as’, u’IN’), (u’a’, u’DT’), (u’nonexecutive’, u’JJ’), (u’director’, u’NN’), (u’Nov.’, u’NNP’), (u’29’, u’CD’), (u’.’, u’.’)]
>>> test_data[0]
[(u’At’, u’IN’), (u’Tokyo’, u’NNP’), (u’,’, u’,’), (u’the’, u’DT’), (u’Nikkei’, u’NNP’), (u’index’, u’NN’), (u’of’, u’IN’), (u’225′, u’CD’), (u’selected’, u’VBN’), (u’issues’, u’NNS’), (u’,’, u’,’), (u’which’, u’WDT’), (u’*T*-1′, u’-NONE-‘), (u’gained’, u’VBD’), (u’132′, u’CD’), (u’points’, u’NNS’), (u’Tuesday’, u’NNP’), (u’,’, u’,’), (u’added’, u’VBD’), (u’14.99′, u’CD’), (u’points’, u’NNS’), (u’to’, u’TO’), (u’35564.43′, u’CD’), (u’.’, u’.’)]
>>>

We use the first 3000 treebank tagged sentences as the train_data, and last 914 tagged sentences as the test_data, now we train TnT POS Tagger by the train_data and evaluate it by the test_data:

>>> from nltk.tag import tnt
>>> tnt_pos_tagger = tnt.TnT()
>>> tnt_pos_tagger.train(train_data)
>>> tnt_pos_tagger.evaluate(test_data)
0.8755881718109216

You can save this pos tagger model as a pickle file:

>>> import pickle
>>> f = open(‘tnt_treebank_pos_tagger.pickle’, ‘w’)
>>> pickle.dump(tnt_pos_tagger, f)
>>> f.close()

And you can use it any time you want:

>>> tnt_tagger.tag(nltk.word_tokenize(“this is a tnt treebank tnt tagger”))
[(‘this’, u’DT’), (‘is’, u’VBZ’), (‘a’, u’DT’), (‘tnt’, ‘Unk’), (‘treebank’, ‘Unk’), (‘tnt’, ‘Unk’), (‘tagger’, ‘Unk’)]
>>>

That’s it, now you can train your POS Tagging Model by yourself, just do it.


Dive Into NLTK, Part IV: Stemming and Lemmatization

Stemming and Lemmatization are the basic text processing methods for English text. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Here is the definition from wikipedia for stemming and lemmatization:

Stemming:

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Stemming programs are commonly referred to as stemming algorithms or stemmers.

Lemmatization:

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

In many languages, words appear in several inflected forms. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The combination of the base form with the part of speech is often called the lexeme of the word.

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

How to use Stemmer in NLTK

NLTK provides several famous stemmers interfaces, such as Porter stemmerLancaster StemmerSnowball Stemmer and etc. In NLTK, using those stemmers is very simple.

For Porter Stemmer, which is based on The Porter Stemming Algorithm, can be used like this:

>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(‘maximum’)
u’maximum’
>>> porter_stemmer.stem(‘presumably’)
u’presum’
>>> porter_stemmer.stem(‘multiply’)
u’multipli’
>>> porter_stemmer.stem(‘provision’)
u’provis’
>>> porter_stemmer.stem(‘owed’)
u’owe’
>>> porter_stemmer.stem(‘ear’)
u’ear’
>>> porter_stemmer.stem(‘saying’)
u’say’
>>> porter_stemmer.stem(‘crying’)
u’cri’
>>> porter_stemmer.stem(‘string’)
u’string’
>>> porter_stemmer.stem(‘meant’)
u’meant’
>>> porter_stemmer.stem(‘cement’)
u’cement’
>>>

For Lancaster Stemmer, which is based on The Lancaster Stemming Algorithm, can be used in NLTK like this:

>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘multiply’)
‘multiply’
>>> lancaster_stemmer.stem(‘provision’)
u’provid’
>>> lancaster_stemmer.stem(‘owed’)
‘ow’
>>> lancaster_stemmer.stem(‘ear’)
‘ear’
>>> lancaster_stemmer.stem(‘saying’)
‘say’
>>> lancaster_stemmer.stem(‘crying’)
‘cry’
>>> lancaster_stemmer.stem(‘string’)
‘string’
>>> lancaster_stemmer.stem(‘meant’)
‘meant’
>>> lancaster_stemmer.stem(‘cement’)
‘cem’
>>>

For Snowball Stemmer, which is based on Snowball Stemming Algorithm, can be used in NLTK like this:

>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’
>>> snowball_stemmer.stem(‘multiply’)
u’multipli’
>>> snowball_stemmer.stem(‘provision’)
u’provis’
>>> snowball_stemmer.stem(‘owed’)
u’owe’
>>> snowball_stemmer.stem(‘ear’)
u’ear’
>>> snowball_stemmer.stem(‘saying’)
u’say’
>>> snowball_stemmer.stem(‘crying’)
u’cri’
>>> snowball_stemmer.stem(‘string’)
u’string’
>>> snowball_stemmer.stem(‘meant’)
u’meant’
>>> snowball_stemmer.stem(‘cement’)
u’cement’
>>>

How to use Lemmatizer in NLTK

The NLTK Lemmatization method is based on WordNet’s built-in morphy function. Here is the introduction from WordNet official website:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.

WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions. First, WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated. Second, WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity

In NLTK, you can use it as the following:

>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(‘dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(‘churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
u’aardwolf’
>>> wordnet_lemmatizer.lemmatize(‘abaci’)
u’abacus’
>>> wordnet_lemmatizer.lemmatize(‘hardrock’)
‘hardrock’
>>> wordnet_lemmatizer.lemmatize(‘are’)
‘are’
>>> wordnet_lemmatizer.lemmatize(‘is’)
‘is’

You would note that the “are” and “is” lemmatize results are not “be”, that’s because the lemmatize method default pos argument is “n”:

lemmatize(word, pos=’n’)

So you need specified the pos for the word like these:

>>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’)
u’be’
>>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’)
u’be’
>>>

We have use POS Tagging before word lemmatization, and implemented it in our Text Analysis API, you can test and use it without specified pos tagger by our Text Analysis API.

The Stem Module in NLTK
You can find the stem module in nltk-master/nltk/stem

YangtekiMacBook-Pro:stem textminer$ ls
total 456
-rwxr-xr-x@ 1 textminer staff 1270 7 22 2013 __init__.py
-rwxr-xr-x@ 1 textminer staff 798 7 22 2013 api.py
-rwxr-xr-x@ 1 textminer staff 17068 7 22 2013 isri.py
-rwxr-xr-x@ 1 textminer staff 11337 7 22 2013 lancaster.py
-rwxr-xr-x@ 1 textminer staff 24735 7 22 2013 porter.py
-rwxr-xr-x@ 1 textminer staff 1701 7 22 2013 regexp.py
-rwxr-xr-x@ 1 textminer staff 5563 7 22 2013 rslp.py
-rwxr-xr-x@ 1 textminer staff 146857 7 22 2013 snowball.py
-rwxr-xr-x@ 1 textminer staff 1513 7 22 2013 wordnet.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Natural Language Toolkit: Stemmers
#
# Copyright (C) 2001-2013 NLTK Project
# Author: Trevor Cohn <tacohn@cs.mu.oz.au>
#         Edward Loper <edloper@gradient.cis.upenn.edu>
#         Steven Bird <stevenbird1@gmail.com>
# URL: <http://www.nltk.org/>
# For license information, see LICENSE.TXT
 
"""
NLTK Stemmers
 
Interfaces used to remove morphological affixes from words, leaving
only the word stem.  Stemming algorithms aim to remove those affixes
required for eg. grammatical role, tense, derivational morphology
leaving only the stem of the word.  This is a difficult problem due to
irregular words (eg. common verbs in English), complicated
morphological rules, and part-of-speech and sense ambiguities
(eg. ``ceil-`` is not the stem of ``ceiling``).
 
StemmerI defines a standard interface for stemmers.
"""
 
from nltk.stem.api import StemmerI
from nltk.stem.regexp import RegexpStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.isri import ISRIStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.rslp import RSLPStemmer
 
 
if __name__ == "__main__":
    import doctest
    doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)

Read the code, and change the World! Now it’s your time!


Dive Into NLTK, Part V: Using Stanford Text Analysis Tools in Python

We have already discussed “How to Use Stanford Named Entity Recognizer (NER) in Python NLTK and Other Programming Languages“, and recently we have also tested the Stanford POS Tagger and Stanford Parser in NLTK and used it in Python. If you want use these Stanford Text Analysis tools in other languages, you can use our Text Analysis API which also integrated the Stanford NLP Tools in it. You can test it here on our online text analysis demo: Text Analysis Online. Now we will tell you how to use these Java NLP Tools in Python NLTK. You can also following the NLTK Official guide: Installing Third Party Software–How NLTK Discovers Third Party Software, here we test it in an ubuntu 12.04 VPS.

First you need set the Java environment for those Java text analysis tools before you using them in NLTK:

sudo apt-get install default-jre
This will install the Java Runtime Environment (JRE). If you instead need the Java Development Kit (JDK), which is usually needed to compile Java applications (for example Apache Ant, Apache Maven, Eclipse and IntelliJ IDEA execute the following command:

sudo apt-get install default-jdk
That is everything that is needed to install Java.

NLTK now provides three interfaces for Stanford Log-linear Part-Of-Speech TaggerStanford Named Entity Recognizer (NER) and Stanford Parser, following is the details about how to use them in NLTK one by one.

1) Stanford POS Tagger

Following is from the official Stanford POS Tagger website:

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ‘noun-plural’.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill’s list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

We assumed you have installed the new version NLTK, here we use ipython and the NLTK version is 3.0.0b1:

In [1]: import nltk

In [2]: nltk.__version__
Out[2]: ‘3.0.0b1’

The Stanford POS Tagger official site provides two versions of POS Tagger:

Download basic English Stanford Tagger version 3.4.1 [21 MB]

Download full Stanford Tagger version 3.4.1 [124 MB]

We suggest you download the full version which contains a lot of models.

After downloading the full version, unzip it and copy the related data in our test directory:

mkdir postagger
cd postagger/
cp ../stanford-postagger-full-2014-08-27/stanford-postagger.jar .
cp -r ../stanford-postagger-full-2014-08-27/models .
ipython –pylab

First test the English POS Tagging Result:

In [1]: from nltk.tag.stanford import POSTagger

In [2]: english_postagger = POSTagger(‘models/english-bidirectional-distsim.tagger’, ‘stanford-postagger.jar’)

In [3]: english_postagger.tag(‘this is stanford postagger in nltk for python users’.split())
Out[3]:
[(u’this’, u’DT’),
(u’is’, u’VBZ’),
(u’stanford’, u’JJ’),
(u’postagger’, u’NN’),
(u’in’, u’IN’),
(u’nltk’, u’NN’),
(u’for’, u’IN’),
(u’python’, u’NN’),
(u’users’, u’NNS’)]

Then test the Chinese POS Tagging result:

In [4]: chinese_postagger = POSTagger(‘models/chinese-distsim.tagger’, ‘stanford-postagger.jar’, encoding=’utf-8′)

In [5]: chinese_postagger.tag(‘这 是 在 Python 环境 中 使用 斯坦福 词性 标 器’.split())
Out[5]:
[(”, u’\u8fd9#PN’),
(”, u’\u662f#VC’),
(”, u’\u5728#P’),
(”, u’Python#NN’),
(”, u’\u73af\u5883#NN’),
(”, u’\u4e2d#LC’),
(”, u’\u4f7f\u7528#VV’),
(”, u’\u65af\u5766\u798f#NR’),
(”, u’\u8bcd\u6027#JJ’),
(”, u’\u6807\u6ce8\u5668#NN’)]

The models contains a lot of pos tagger models, you can find the details info from the README-Models.txt:

English taggers
—————————
wsj-0-18-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.
Performance:
97.28% correct on WSJ 19-21
(90.46% correct on unknown words)

wsj-0-18-left3words.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape features. Penn tagset.
Performance:
96.97% correct on WSJ 19-21
(88.85% correct on unknown words)

wsj-0-18-left3words-distsim.tagger
Trained on WSJ sections 0-18 using the left3words architecture and
includes word shape and distributional similarity features. Penn tagset.
Performance:
97.01% correct on WSJ 19-21
(89.81% correct on unknown words)

english-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset.

english-bidirectional-distsim.tagger
Trained on WSJ sections 0-18 using a bidirectional architecture and
including word shape and distributional similarity features.
Penn Treebank tagset.

wsj-0-18-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 left3words architecture and includes word
shape and distributional similarity features. Penn tagset. Ignores case.

english-caseless-left3words-distsim.tagger
Trained on WSJ sections 0-18 and extra parser training data using the
left3words architecture and includes word shape and distributional
similarity features. Penn tagset. Ignores case.

Chinese tagger
—————————
chinese-nodistsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources.
LDC Chinese Treebank POS tag set.
Performance:
93.46% on a combination of Chinese and Hong Kong texts
(79.40% on unknown words)

chinese-distsim.tagger
Trained on a combination of CTB7 texts from Chinese and Hong Kong
sources with distributional similarity clusters.
LDC Chinese Treebank POS tag set.
Performance:
93.99% on a combination of Chinese and Hong Kong texts
(84.60% on unknown words)

Arabic tagger
—————————
arabic.tagger
Trained on the *entire* ATB p1-3.
When trained on the train part of the ATB p1-3 split done for the 2005
JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets
the following performance:
96.26% on test portion according to Diab split
(80.14% on unknown words)

French tagger
—————————
french.tagger
Trained on the French treebank.

German tagger
—————————
german-hgc.tagger
Trained on the first 80% of the Negra corpus, which uses the STTS tagset.
The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating
German text corpora with part-of-speech labels, which was jointly
developed by the Institut für maschinelle Sprachverarbeitung of the
University of Stuttgart and the Seminar für Sprachwissenschaft of the
University of Tübingen. See:
http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html
This model uses features from the distributional similarity clusters
built over the HGC.
Performance:
96.90% on the first half of the remaining 20% of the Negra corpus (dev set)
(90.33% on unknown words)

german-dewac.tagger
This model uses features from the distributional similarity clusters
built from the deWac web corpus.

german-fast.tagger
Lacks distributional similarity features, but is several times faster
than the other alternatives.
Performance:
96.61% overall / 86.72% unknown.

2) Stanford Named Entity Recognizer (NER)

Following introduction is from the official Stanford NER website:

Stanford NER is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances, including models trained on just the CoNLL 2003 English training data. The distributional similarity features in some models improve performance but the models require considerably more memory.

The website provides a download version of Stanford NER:

Download Stanford Named Entity Recognizer version 3.4.1

It contains the stanford-ner.jar and models in the classifies directory, and like the Stanford POS Tagger, you can use it in NLTK like this:

In [1]: from nltk.tag.stanford import NERTagger

In [2]: english_nertagger = NERTagger(‘classifiers/english.all.3class.distsim.crf.ser.gz’, ‘stanford-ner.jar’)

In [3]: english_nertagger.tag(‘Rami Eid is studying at Stony Brook University in NY’.split())
Out[3]:
[(u’Rami’, u’PERSON’),
(u’Eid’, u’PERSON’),
(u’is’, u’O’),
(u’studying’, u’O’),
(u’at’, u’O’),
(u’Stony’, u’ORGANIZATION’),
(u’Brook’, u’ORGANIZATION’),
(u’University’, u’ORGANIZATION’),
(u’in’, u’O’),
(u’NY’, u’O’)]

The Models Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Time, Location, Organization, Person, Money, Percent, Date

-rw-r–r–@ 1 textminer staff 24732086 9 7 11:43 english.all.3class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1274 9 7 11:43 english.all.3class.distsim.prop
-rw-r–r–@ 1 textminer staff 18350357 9 7 11:43 english.conll.4class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1421 9 7 11:43 english.conll.4class.distsim.prop
-rw-r–r–@ 1 textminer staff 17824631 9 7 11:43 english.muc.7class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1087 9 7 11:43 english.muc.7class.distsim.prop
-rw-r–r–@ 1 textminer staff 18954462 9 7 11:43 english.nowiki.3class.distsim.crf.ser.gz
-rw-r–r–@ 1 textminer staff 1218 9 7 11:43 english.nowiki.3class.distsim.prop

You can test the 7 class Stanford NER on our Text Analysis Online Demo: NLTK Stanford Named Entity Recognizer for 7Class

3) Stanford Parser

From the official Stanford Parser introduction:

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as “phrases”) and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s.

You should download the Stanford Parser first: Download Stanford Parser version 3.4.1, then use it in the Python by NLTK:

In [1]: from nltk.parse.stanford import StanfordParser

In [3]: english_parser = StanfordParser(‘stanford-parser.jar’, ‘stanford-parser-3.4-models.jar’)

In [4]: english_parser.raw_parse_sents((“this is the english parser test”, “the parser is from stanford parser”))
Out[4]:
[[u’this/DT is/VBZ the/DT english/JJ parser/NN test/NN’],
[u'(ROOT’,
u’ (S’,
u’ (NP (DT this))’,
u’ (VP (VBZ is)’,
u’ (NP (DT the) (JJ english) (NN parser) (NN test)))))’],
[u’nsubj(test-6, this-1)’,
u’cop(test-6, is-2)’,
u’det(test-6, the-3)’,
u’amod(test-6, english-4)’,
u’nn(test-6, parser-5)’,
u’root(ROOT-0, test-6)’],
[u’the/DT parser/NN is/VBZ from/IN stanford/JJ parser/NN’],
[u'(ROOT’,
u’ (S’,
u’ (NP (DT the) (NN parser))’,
u’ (VP (VBZ is)’,
u’ (PP (IN from)’,
u’ (NP (JJ stanford) (NN parser))))))’],
[u’det(parser-2, the-1)’,
u’nsubj(is-3, parser-2)’,
u’root(ROOT-0, is-3)’,
u’amod(parser-6, stanford-5)’,
u’prep_from(is-3, parser-6)’]]

Note that this is different from the default NLTK nltk/parse/stanford.py, we modified some code, and output the tag, parse, and dependency result:

#’-outputFormat’, ‘penn’, # original
‘-outputFormat’, ‘wordsAndTags,penn,typedDependencies’, # modified

Now you can use the Stanford NLP Tools like POS Tagger, NER, and Parser in Python by NLTK, just enjoy it.

Comments