Good Stuff‎ > ‎Data‎ > ‎

NLTK


$ python 
>>> import nltk
>>> nltk.download()

Some NLP Data packages
  [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology)
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] brown............... Brown Corpus
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] chat80.............. Chat-80 Data Files
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] city_database....... City Database
  [ ] comtrans............ ComTrans Corpus Sample
  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
  [ ] europarl_raw........ Sample European Parliament Proceedings Parallel Corpus
  [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset)
  [ ] conll2000........... CONLL 2000 Chunking Corpus
  [ ] dependency_treebank. Dependency Parsed Treebank
  
  [ ] floresta............ Portuguese Treebank                                                                                                                     
  [ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags
  [ ] gazetteers.......... Gazeteer Lists
  [ ] genesis............. Genesis Corpus
  [ ] gutenberg........... Project Gutenberg Selections
  [ ] inaugural........... C-Span Inaugural Address Corpus
  [ ] jeita............... JEITA Public Morphologically Tagged Corpus (in ChaSen format)
  [ ] ieer................ NIST IE-ER DATA SAMPLE
  [ ] machado............. Machado de Assis -- Obra Completa
  [ ] indian.............. Indian Language POS-Tagged Corpus
  [ ] movie_reviews....... Sentiment Polarity Dataset Version 2.0
  [ ] kimmo............... PC-KIMMO Data Files
  [ ] knbc................ KNB Corpus (Annotated blog corpus)
  [ ] langid.............. Language Id Corpus
  [ ] lin_thesaurus....... Lin's Dependency Thesaurus
  [ ] names............... Names Corpus, Version 1.3 (1994-03-29)
  [ ] nombank.1.0......... NomBank Corpus 1.0
  [ ] oanc_masc........... Open American National Corpus: Manually Annotated Sub-Corpus

  [ ] pl196x.............. Polish language of the XX century sixties
  [ ] paradigms........... Paradigm Corpus
  [ ] nps_chat............ NPS Chat
  [ ] pe08................ Cross-Framework and Cross-Domain Parser Evaluation Shared Task
  [ ] pil................. The Patient Information Leaflet (PIL) Corpus
  [ ] qc.................. Experimental Data for Question Classification
  [ ] ptb................. Penn Treebank
  [ ] ppattach............ Prepositional Phrase Attachment Corpus
  [ ] propbank............ Proposition Bank Corpus 1.0
  [ ] problem_reports..... Problem Report Corpus
  [ ] rte................. PASCAL RTE Challenges 1, 2, and 3
  [ ] sinica_treebank..... Sinica Treebank Corpus Sample
  [ ] verbnet............. VerbNet Lexicon, Version 2.1
  [ ] reuters............. The Reuters-21578 benchmark corpus, ApteMod version

  [ ] semcor.............. SemCor 3.0                                                                                                                               
  [ ] senseval............ SENSEVAL 2 Corpus: Sense Tagged Text
  [ ] smultron............ SMULTRON Corpus Sample
  [ ] shakespeare......... Shakespeare XML Corpus Sample
  [ ] state_union......... C-Span State of the Union Address Corpus Hit Enter to continue: 
  [ ] stopwords........... Stopwords Corpus
  [ ] swadesh............. Swadesh Wordlists
  [ ] switchboard......... Switchboard Corpus Sample
  [ ] toolbox............. Toolbox Sample Files
  [ ] udhr2............... Universal Declaration of Human Rights Corpus (Unicode Version)
  [ ] timit............... TIMIT Corpus Sample
  [ ] wordnet............. WordNet
  [ ] treebank............ Penn Treebank Sample
  [ ] udhr................ Universal Declaration of Human Rights Corpus
  [ ] webtext............. Web Text Corpus
  [ ] unicode_samples..... Unicode Samples
  [ ] ycoe................ York-Toronto-Helsinki Parsed Corpus of Old English Prose
  [ ] sample_grammars..... Sample Grammars
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] wordnet_ic.......... WordNet-InfoContent
  [ ] words............... Word Lists
  [ ] spanish_grammars.... Grammars for Spanish
  [ ] basque_grammars..... Grammars for Basque
  [ ] large_grammars...... Large context-free and feature-based grammars for parser comparison

  [ ] tagsets............. Help on Tagsets
  [ ] maxent_treebank_pos_tagger Treebank Part of Speech Tagger (Maximum entropy)
  [ ] rslp................ RSLP Stemmer (Removedor de Sufixos da Lingua Portuguesa)
  [ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
  [ ] punkt............... Punkt Tokenizer Models

Collections:
  [ ] all-corpora......... All the corpora
  [ ] all................. All packages
  [ ] book................ Everything used in the NLTK Book

([*] marks installed packages)

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
  Identifier> brown

Check installed the brown package
>>> from nltk.corpus import brown 
>>> brown.words() 
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]


Try NLTK
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
















Comments