Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎Text Mining‎ > ‎NLP‎ > ‎NLP with Python NLTK‎ > ‎

Dive into NLTK Part II

Dive Into NLTK, Part VI: Add Stanford Word Segmenter Interface for Python NLTK

Stanford Word Segmenter is one of the open source Java text analysis tools provided by Stanford NLP Group. We have said how to using Stanford text analysis tools in NLTK, cause NLTK provide the interfaces for those Stanford NLP Tools like POS TaggerNamed Entity Recognizer and Parser. But for Stanford Word Segmenter, there is no interface in NLTK, no interface in Python, by google. So I decided write the Stanford Segmenter interface in NLTK, like the tagger and parser.

But before you can use it in Python NLTK, you should first install the latest version of NLTK by the source code, here we recommended the develop version of NLTK in github: https://github.com/nltk/nltk. Second you need install the Java environment, following is the steps in Ubuntu 12.04 vps:

sudo apt-get update
Then, check if Java is not already installed:

java -version
If it returns “The program java can be found in the following packages”, Java hasn’t been installed yet, so execute the following command:

sudo apt-get install default-jre
This will install the Java Runtime Environment (JRE). If you instead need the Java Development Kit (JDK), which is usually needed to compile Java applications (for example Apache Ant, Apache Maven, Eclipse and IntelliJ IDEA execute the following command:

sudo apt-get install default-jdk
That is everything that is needed to install Java.

The last thing is download and unzip the latest Stanford Word Segmenter Package: Download Stanford Word Segmenter version 2014-08-27.

In NLTK code, the Stanford Tagger interface is here: nltk/tag/stanford.py, the Stanford Parser interface is here: nltk/parse/stanford.py, we want add the Stanford Segmenter in the nltk/tokenize director, but found an stanford.py which support Stanford PTBTokenizer. So we add a stanford_segmenter.py in the nltk/tokenize director, which used as the Stanford Word Segmenter interface and based on Linux PIPE and Python subprocess module:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Natural Language Toolkit: Interface to the Stanford Chinese Segmenter
#
# Copyright (C) 2001-2014 NLTK Project
# Author: 52nlp <52nlpcn@gmail.com>
#
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
 
from __future__ import unicode_literals, print_function
 
import tempfile
import os
import json
from subprocess import PIPE
 
from nltk import compat
from nltk.internals import find_jar, config_java, java, _java_options
 
from nltk.tokenize.api import TokenizerI
 
class StanfordSegmenter(TokenizerI):
    r"""
    Interface to the Stanford Segmenter
 
    >>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
    >>> segmenter = StanfordSegmenter(path_to_jar="stanford-segmenter-3.4.1.jar", path_to_sihan_corpora_dict="./data", path_to_model="./data/pku.gz", path_to_dict="./data/dict-chris6.ser.gz")
    >>> sentence = u"这是斯坦福中文分词器测试"
    >>> segmenter.segment(sentence)
    >>> u'\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n'
    >>> segmenter.segment_file("test.simp.utf8")
    >>> u'\u9762\u5bf9 \u65b0 \u4e16\u7eaa \uff0c \u4e16\u754c \u5404\u56fd ...
    """
 
    _JAR = 'stanford-segmenter.jar'
 
    def __init__(self, path_to_jar=None,
            path_to_sihan_corpora_dict=None,
            path_to_model=None, path_to_dict=None,
            encoding='UTF-8', options=None,
            verbose=False, java_options='-mx2g'):
        self._stanford_jar = find_jar(
            self._JAR, path_to_jar,
            env_vars=('STANFORD_SEGMENTER',),
            searchpath=(),
            verbose=verbose
        )
        self._sihan_corpora_dict = path_to_sihan_corpora_dict
        self._model = path_to_model
        self._dict = path_to_dict
 
        self._encoding = encoding
        self.java_options = java_options
        options = {} if options is None else options
        self._options_cmd = ','.join('{0}={1}'.format(key, json.dumps(val)) for key, val in options.items())
 
    def segment_file(self, input_file_path):
        """
        """
        cmd = [
            'edu.stanford.nlp.ie.crf.CRFClassifier',
            '-sighanCorporaDict', self._sihan_corpora_dict,
            '-textFile', input_file_path,
            '-sighanPostProcessing', 'true',
            '-keepAllWhitespaces', 'false',
            '-loadClassifier', self._model,
            '-serDictionary', self._dict
        ]
 
        stdout = self._execute(cmd)
 
        return stdout
 
    def segment(self, tokens):
        return self.segment_sents([tokens])
 
    def segment_sents(self, sentences):
        """
        """
        encoding = self._encoding
        # Create a temporary input file
        _input_fh, self._input_file_path = tempfile.mkstemp(text=True)
 
        # Write the actural sentences to the temporary input file
        _input_fh = os.fdopen(_input_fh, 'wb')
        _input = '\n'.join((' '.join(x) for x in sentences))
        if isinstance(_input, compat.text_type) and encoding:
            _input = _input.encode(encoding)
        _input_fh.write(_input)
        _input_fh.close()
 
        cmd = [
            'edu.stanford.nlp.ie.crf.CRFClassifier',
            '-sighanCorporaDict', self._sihan_corpora_dict,
            '-textFile', self._input_file_path,
            '-sighanPostProcessing', 'true',
            '-keepAllWhitespaces', 'false',
            '-loadClassifier', self._model,
            '-serDictionary', self._dict
        ]
 
        stdout = self._execute(cmd)
 
        # Delete the temporary file
        os.unlink(self._input_file_path)
 
        return stdout
 
    def _execute(self, cmd, verbose=False):
        encoding = self._encoding
        cmd.extend(['-inputEncoding', encoding])
        _options_cmd = self._options_cmd
        if _options_cmd:
            cmd.extend(['-options', self._options_cmd])
 
        default_options = ' '.join(_java_options)
 
        # Configure java.
        config_java(options=self.java_options, verbose=verbose)
 
        stdout, _stderr = java(cmd,classpath=self._stanford_jar, stdout=PIPE, stderr=PIPE)
        stdout = stdout.decode(encoding)
 
        # Return java configurations to their default values.
        config_java(options=default_options, verbose=False)
 
        return stdout
 
def setup_module(module):
    from nose import SkipTest
 
    try:
        StanfordSegmenter()
    except LookupError:
        raise SkipTest('doctests from nltk.tokenize.stanford_segmenter are skipped because the stanford segmenter jar doesn\'t exist')

We have forked the latest NLTK project and add the stanford_segmenter.py in it. You can get this version or just add the stanford_segmenter.py in your latest NLTK package in the nltk/tokenize/ directory, and reinstall it. The using example we have showed in the code, for test, you need “cd stanford-segmenter-2014-08-27″ first, than test it in the python interpreter:

>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> segmenter = StanfordSegmenter(path_to_jar=”stanford-segmenter-3.4.1.jar”, path_to_sihan_corpora_dict=”./data”, path_to_model=”./data/pku.gz”, path_to_dict=”./data/dict-chris6.ser.gz”)
>>> sentence = u”这是斯坦福中文分词器测试”
>>> segmenter.segment(sentence)
>>> u’\u8fd9 \u662f \u65af\u5766\u798f \u4e2d\u6587 \u5206\u8bcd\u5668 \u6d4b\u8bd5\n’
>>> segmenter.segment_file(“test.simp.utf8”)
>>> u’\u9762\u5bf9 \u65b0 \u4e16\u7eaa \uff0c \u4e16\u754c \u5404\u56fd .
>>> outfile = open(‘outfile’, ‘w’)
>>> result = segmenter.segment(sentence)
>>> outfile.write(result.encode(‘UTF-8′))
>>> outfile.close()

Open the outfile, we get: 这 是 斯坦福 中文 分词器 测试.

The problem we met here is when execute the “segment” or “segment_file” method once, the interface could load the model and dict once. I try use the “readStdin” and communicate method in the subprocess module, but cannot resolve this problem. After google the PIPE and subprocess document long time, I can’t find a proper method to load the model and dict first, then execute the segment one by one no need loading the data. Can you give me a method to resolve this problem?


Dive Into NLTK, Part VII: A Preliminary Study on Text Classification

Text Classification is very useful technique in text analysis, such as it can be used in spam filtering, language identificationsentiment analysis, genre classification and etc. According wikipedia, text classification also refer as document classification:

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done “manually” (or “intellectually”) or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

And in this article we focus on the automatic text (document) classification, and if you are not familiar with this technique, we strongly recommend you to learn the Stanford NLP Course in Coursera first, where the week 3 lecture show you what is Text Classification and Naive Bayes Model,Week 4 lecture show you the Discriminative Model with Maximum Entropy classifiers: Natural Language Processing by Dan Jurafsky, Christopher Manning

Here we will directly dive into NLTK and talk all text classification related things in NLTK. You can find the NLTK Classifier Code in the nltk/nltk/classify directory, by the __init__.py file, we can learn something about the NLTK Classifier interfaces:

# Natural Language Toolkit: Classifiers
#
# Copyright (C) 2001-2014 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
 
"""
Classes and interfaces for labeling tokens with category labels (or
"class labels").  Typically, labels are represented with strings
(such as ``'health'`` or ``'sports'``).  Classifiers can be used to
perform a wide range of classification tasks.  For example,
classifiers can be used...
 
- to classify documents by topic
- to classify ambiguous words by which word sense is intended
- to classify acoustic signals by which phoneme they represent
- to classify sentences by their author
 
Features
========
In order to decide which category label is appropriate for a given
token, classifiers examine one or more 'features' of the token.  These
"features" are typically chosen by hand, and indicate which aspects
of the token are relevant to the classification decision.  For
example, a document classifier might use a separate feature for each
word, recording how often that word occurred in the document.
 
Featuresets
===========
The features describing a token are encoded using a "featureset",
which is a dictionary that maps from "feature names" to "feature
values".  Feature names are unique strings that indicate what aspect
of the token is encoded by the feature.  Examples include
``'prevword'``, for a feature whose value is the previous word; and
``'contains-word(library)'`` for a feature that is true when a document
contains the word ``'library'``.  Feature values are typically
booleans, numbers, or strings, depending on which feature they
describe.
 
Featuresets are typically constructed using a "feature detector"
(also known as a "feature extractor").  A feature detector is a
function that takes a token (and sometimes information about its
context) as its input, and returns a featureset describing that token.
For example, the following feature detector converts a document
(stored as a list of words) to a featureset describing the set of
words included in the document:
 
    >>> # Define a feature detector function.
    >>> def document_features(document):
    ...     return dict([('contains-word(%s)' % w, True) for w in document])
 
Feature detectors are typically applied to each token before it is fed
to the classifier:
 
    >>> # Classify each Gutenberg document.
    >>> from nltk.corpus import gutenberg
    >>> for fileid in gutenberg.fileids(): # doctest: +SKIP
    ...     doc = gutenberg.words(fileid) # doctest: +SKIP
    ...     print fileid, classifier.classify(document_features(doc)) # doctest: +SKIP
 
The parameters that a feature detector expects will vary, depending on
the task and the needs of the feature detector.  For example, a
feature detector for word sense disambiguation (WSD) might take as its
input a sentence, and the index of a word that should be classified,
and return a featureset for that word.  The following feature detector
for WSD includes features describing the left and right contexts of
the target word:
 
    >>> def wsd_features(sentence, index):
    ...     featureset = {}
    ...     for i in range(max(0, index-3), index):
    ...         featureset['left-context(%s)' % sentence[i]] = True
    ...     for i in range(index, max(index+3, len(sentence))):
    ...         featureset['right-context(%s)' % sentence[i]] = True
    ...     return featureset
 
Training Classifiers
====================
Most classifiers are built by training them on a list of hand-labeled
examples, known as the "training set".  Training sets are represented
as lists of ``(featuredict, label)`` tuples.
"""
 
from nltk.classify.api import ClassifierI, MultiClassifierI
from nltk.classify.megam import config_megam, call_megam
from nltk.classify.weka import WekaClassifier, config_weka
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
from nltk.classify.util import accuracy, apply_features, log_likelihood
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding,
                                  TypedMaxentFeatureEncoding,
                                  ConditionalExponentialClassifier)

The most basic thing for a supervised text classifier is the labeled category data, which can be used as a training data. As an example, we use the NLTK Name corpus to train a Gender Identification classifier:

In [1]: from nltk.corpus import names
 
In [2]: import random
 
In [3]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
 
In [4]: random.shuffle(names)
 
In [5]: len(names)
Out[5]: 7944
 
In [6]: names[0:10]
Out[6]: 
[(u'Marthe', 'female'),
 (u'Elana', 'female'),
 (u'Ernie', 'male'),
 (u'Colleen', 'female'),
 (u'Lynde', 'female'),
 (u'Barclay', 'male'),
 (u'Skippy', 'male'),
 (u'Marcelia', 'female'),
 (u'Charlena', 'female'),
 (u'Ronnica', 'female')]

The most important thing for a text classifier is feature, which can be very flexible, and defined by human engineer. Here, we just use the final letter of a given name as the feature, and build a dictionary containing relevant information about a given name:

In [7]: def gender_features(word):
   ...:     return {'last_letter': word[-1]}
   ...: 
 
In [8]: gender_features('Gary')
Out[8]: {'last_letter': 'y'}

The dictionary that is returned by this function is called a feature set and maps from features’ names to their values. Feature set is core part for NLTK Classifier, we can use the feature extractor to extract feature sets for NLTK Classifier and segment them into training set and testing set:

In [9]: featuresets = [(gender_features(n), g) for (n, g) in names]
 
In [10]: len(featuresets)
Out[10]: 7944
 
In [11]: featuresets[0:10]
Out[11]: 
[({'last_letter': u'e'}, 'female'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'e'}, 'male'),
 ({'last_letter': u'n'}, 'female'),
 ({'last_letter': u'e'}, 'female'),
 ({'last_letter': u'y'}, 'male'),
 ({'last_letter': u'y'}, 'male'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'a'}, 'female'),
 ({'last_letter': u'a'}, 'female')]
 
In [12]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [13]: len(train_set)
Out[13]: 7444
 
In [14]: len(test_set)
Out[14]: 500

A learning algorithm is very useful for a classifier, here we will show you how to use the Naive Bayes and Maximum Entropy Model to train a NaiveBayes and Maxent Classifier, where Naive Bayes is the Generative Model and Maxent is Discriminative Model.

Here is how to train a Naive Bayes classifier for Gender Identification:

In [16]: from nltk import NaiveBayesClassifier
 
In [17]: nb_classifier = NaiveBayesClassifier.train(train_set)
 
In [18]: nb_classifier.classify(gender_features('Gary'))
Out[18]: 'female'
 
In [19]: nb_classifier.classify(gender_features('Grace'))
Out[19]: 'female'
 
In [20]: from nltk import classify
 
In [21]: classify.accuracy(nb_classifier, test_set)
Out[21]: 0.73
 
In [22]: nb_classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = u'a'           female : male   =     38.4 : 1.0
             last_letter = u'k'             male : female =     33.4 : 1.0
             last_letter = u'f'             male : female =     16.7 : 1.0
             last_letter = u'p'             male : female =     11.9 : 1.0
             last_letter = u'v'             male : female =     10.6 : 1.0

Here is how to train a Maximum Entropy Classifier for Gender Identification:

In [23]: from nltk import MaxentClassifier
 
In [24]: me_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.37066        0.765
             3          -0.37029        0.765
             4          -0.37007        0.765
             5          -0.36992        0.765
             6          -0.36981        0.765
             7          -0.36973        0.765
             8          -0.36967        0.765
             9          -0.36962        0.765
            10          -0.36958        0.765
            11          -0.36955        0.765
            12          -0.36952        0.765
            13          -0.36949        0.765
            14          -0.36947        0.765
            15          -0.36945        0.765
            16          -0.36944        0.765
            17          -0.36942        0.765
            18          -0.36941        0.765
            ....
In [25]: me_classifier.classify(gender_features('Gary'))
Out[25]: 'female'
 
In [26]: me_classifier.classify(gender_features('Grace'))
Out[26]: 'female'
 
In [27]: classify.accuracy(me_classifier, test_set)
Out[27]: 0.728
 
In [28]: me_classifier.show_most_informative_features(5)
   6.644 last_letter==u'c' and label is 'male'
  -5.082 last_letter==u'a' and label is 'male'
  -3.565 last_letter==u'k' and label is 'female'
  -2.700 last_letter==u'f' and label is 'female'
  -2.248 last_letter==u'p' and label is 'female'

It seems that Naive Bayes and Maxent Model have the same result on this Gender Task, but that’s not true. Choosing right features and deciding how to encode them for the task have an big impact on the performance. Here we define a more complex feature extractor function and train the model again:

In [29]: def gender_features2(name):
   ....:     features = {}
   ....:     features["firstletter"] = name[0].lower()
   ....:     features["lastletter"] = name[-1].lower()
   ....:     for letter in 'abcdefghijklmnopqrstuvwxyz':
   ....:         features["count(%s)" % letter] = name.lower().count(letter)
   ....:         features["has(%s)" % letter] = (letter in name.lower())
   ....:     return features
   ....: 
 
In [30]: gender_features2('Gary')
Out[30]: 
{'count(a)': 1,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 1,
 'count(h)': 0,
 'count(i)': 0,
 'count(j)': 0,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 0,
 'count(o)': 0,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 1,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 1,
 'count(z)': 0,
 'firstletter': 'g',
 'has(a)': True,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': True,
 'has(h)': False,
 'has(i)': False,
 'has(j)': False,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': False,
 'has(o)': False,
 'has(p)': False,
 'has(q)': False,
 'has(r)': True,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': True,
 'has(z)': False,
 'lastletter': 'y'}
 
In [32]: featuresets = [(gender_features2(n), g) for (n, g) in names]
 
In [34]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [35]: nb2_classifier = NaiveBayesClassifier.train(train_set)
 
In [36]: classify.accuracy(nb2_classifier, test_set)
Out[36]: 0.774
 
In [37]: me2_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.61051        0.631
             3          -0.59637        0.631
             4          -0.58304        0.631
             5          -0.57050        0.637
             6          -0.55872        0.651
             7          -0.54766        0.672
             8          -0.53728        0.689
             ....
            93          -0.33632        0.805
            94          -0.33588        0.805
            95          -0.33545        0.805
            96          -0.33503        0.805
            97          -0.33462        0.805
            98          -0.33421        0.805
            99          -0.33382        0.805
         Final          -0.33343        0.805
 
In [38]: classify.accuracy(me2_classifier, test_set)
Out[38]: 0.78

It seems that more features make Maximum Entropy Model more accuracy, but more slow when training it. We can define the third feature extractor function and train Naive Bayes and Maxent Classifier models again:

In [49]: def gender_features3(name):
    features = {}
    features["fl"] = name[0].lower()
    features["ll"] = name[-1].lower()
    features["fw"] = name[:2].lower()
    features["lw"] = name[-2:].lower()
    return features
 
In [50]: gender_features3('Gary')
Out[50]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'}
 
In [51]: gender_features3('G')
Out[51]: {'fl': 'g', 'fw': 'g', 'll': 'g', 'lw': 'g'}
 
In [52]: gender_features3('Gary')
Out[52]: {'fl': 'g', 'fw': 'ga', 'll': 'y', 'lw': 'ry'}
 
In [53]: featuresets = [(gender_features3(n), g) for (n, g) in names]
 
In [54]: featuresets[0]
Out[54]: ({'fl': u'm', 'fw': u'ma', 'll': u'e', 'lw': u'he'}, 'female')
 
In [55]: len(featuresets)
Out[55]: 7944
 
In [56]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [57]: nb3_classifier = NaiveBayesClassifier.train(train_set)
 
In [59]: classify.accuracy(nb3_classifier, test_set)
Out[59]: 0.77
In [60]: me3_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)
 
      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.40398        0.800
             3          -0.34739        0.821
             4          -0.32196        0.825
             5          -0.30766        0.827
             ......
            95          -0.25608        0.839
            96          -0.25605        0.839
            97          -0.25601        0.839
            98          -0.25598        0.839
            99          -0.25595        0.839
         Final          -0.25591        0.839
 
In [61]: classify.accuracy(me3_classifier, test_set)
Out[61]: 0.798

It seems that with a proper feature extraction, Maximum Entropy Classifier can get better performance on the test set. Actually, sometimes selecting right feature is more important in a supervised text classification. You need spend a lot of time to choose features and select a good learning algorithm with parameters adjustment for your text classifier.

Our preliminary study on text classification in NLTK is over, and next chapter we will dive into text classification with a useful example.


Dive Into NLTK, Part VIII: Using External Maximum Entropy Modeling Libraries for Text Classification

Maximum entropy modeling, also known as Multinomial logistic regression, is one of the most popular framework for text analysis tasks since first introduced into the NLP area by Berger and Della Pietra at 1996. A lot of Maximum entropy model tools and libraries are implemented by several programming languages since then, and you can find a complete list of Maximum entropy models related software by this website which maintained by Doctor Le Zhang: Maximum Entropy Modeling, which also contains very useful materials for maximum entropy models.

NLTK provides several learning algorithms for text classification, such as naive bayes, decision trees, and also includes maximum entropy models, you can find them all in the nltk/classify module. For Maximum entropy modeling, you can find the details in the maxent.py:

# Natural Language Toolkit: Maximum Entropy Classifiers
#
# Copyright (C) 2001-2014 NLTK Project
# Author: Edward Loper <edloper@gmail.com>
#         Dmitry Chichkov <dchichkov@gmail.com> (TypedMaxentFeatureEncoding)
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
 
"""
A classifier model based on maximum entropy modeling framework.  This
framework considers all of the probability distributions that are
empirically consistent with the training data; and chooses the
distribution with the highest entropy.  A probability distribution is
"empirically consistent" with a set of training data if its estimated
frequency with which a class and a feature vector value co-occur is
equal to the actual frequency in the data.
 
Terminology: 'feature'
======================
The term *feature* is usually used to refer to some property of an
unlabeled token.  For example, when performing word sense
disambiguation, we might define a ``'prevword'`` feature whose value is
the word preceding the target word.  However, in the context of
maxent modeling, the term *feature* is typically used to refer to a
property of a "labeled" token.  In order to prevent confusion, we
will introduce two distinct terms to disambiguate these two different
concepts:
 
  - An "input-feature" is a property of an unlabeled token.
  - A "joint-feature" is a property of a labeled token.
 
In the rest of the ``nltk.classify`` module, the term "features" is
used to refer to what we will call "input-features" in this module.
 
In literature that describes and discusses maximum entropy models,
input-features are typically called "contexts", and joint-features
are simply referred to as "features".
 
Converting Input-Features to Joint-Features
-------------------------------------------
In maximum entropy models, joint-features are required to have numeric
values.  Typically, each input-feature ``input_feat`` is mapped to a
set of joint-features of the form:
 
|   joint_feat(token, label) = { 1 if input_feat(token) == feat_val
|                              {      and label == some_label
|                              {
|                              { 0 otherwise
 
For all values of ``feat_val`` and ``some_label``.  This mapping is
performed by classes that implement the ``MaxentFeatureEncodingI``
interface.
"""
from __future__ import print_function, unicode_literals
__docformat__ = 'epytext en'
 
try:
    import numpy
except ImportError:
    pass
 
import time
import tempfile
import os
import gzip
from collections import defaultdict
 
from nltk import compat
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist
 
from nltk.classify.api import ClassifierI
from nltk.classify.util import CutoffChecker, accuracy, log_likelihood
from nltk.classify.megam import call_megam, write_megam_file, parse_megam_weights
from nltk.classify.tadm import call_tadm, write_tadm_file, parse_tadm_weights
...

Within NLTK, the Maxent training algorithms support GIS(Generalized Iterative Scaling), IIS(Improved Iterative Scaling), and LM-BFGS. The first two are implemented in NLTK by Python but seems very slow and costs large memory for the same training data. And the LBFGS algorithm, support external libraries like MEGAM(MEGA Model Optimization Package), which is very elegant.

MEGAM is based on the OCaml system, which is the main implementation of the Caml language. Caml is a general-purpose programming language, designed with program safety and reliability in mind. In order use MEGAM in your system, you need install OCaml first. In my Ubuntu 12.04 VPS, it’s very easy to install the latest ocaml version of 4.02:

wget http://caml.inria.fr/pub/distrib/ocaml-4.02/ocaml-4.02.1.tar.gz
tar -zxvf ocaml-4.02.1.tar.gz
./configure
make world.opt
sudo make install

After install Ocaml, it’s time to install Megam:

wget http://hal3.name/megam/megam_src.tgz
tar -zxvf megam_src.tgz
cd megam_0.92

From README, install megam is very easy:

To build a safe but slow version, just execute:

make

which will produce an executable megam, unless something goes wrong.

To build a fast but not so safe version, execuate

make opt

which will produce an executable megam.opt that will be much much
faster. If you encounter any bugs, please let me know (if something
crashes, it’s probably easiest to switch to the safe by slow version,
run it, and let me know what the error message is).

But when we execute the “make” first, we met the error like this:


ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
fastdot_c.c:4:19: fatal error: alloc.h: No such file or directory

Here you should use the “ocamlc -where” to find the right ocmal library: /usr/local/lib/ocaml , and then edit the Makefile 74 line (Note that this editing is on my Ubuntu 12.04 VPS):

#WITHCLIBS =-I /usr/lib/ocaml/caml
WITHCLIBS =-I /usr/local/lib/ocaml/caml

Then execute the “make” again, but met another problem:


ocamlc -g -custom -o megam str.cma -cclib -lstr bigarray.cma -cclib -lbigarray unix.cma -cclib -lunix -I /usr/local/lib/ocaml/caml fastdot_c.c fastdot.cmo intHashtbl.cmo arry.cmo util.cmo data.cmo bitvec.cmo cg.cmo wsemlm.cmo bfgs.cmo pa.cmo perceptron.cmo radapt.cmo kernelmap.cmo abffs.cmo main.cmo
/usr/bin/ld: cannot find -lstr
”’

Here you should edit the Makefile again, changed the 62 line -lstr to -lcamlstr:

#WITHSTR =str.cma -cclib -lstr
WITHSTR =str.cma -cclib -lcamlstr

Then you can type “make” make a slower version of executable file “megam” and type “make opt” to get a faster version of executable file “megam.opt” in the Makefile directory.

But that’s not the end, if you want use it in the NLTK, you should tell the NLTK where to find the “megam” or “megam.opt” binary, NLTK use a config_megam to lookup the binary version:

_megam_bin = None
def config_megam(bin=None):
    """
    Configure NLTK's interface to the ``megam`` maxent optimization
    package.
 
    :param bin: The full path to the ``megam`` binary.  If not specified,
        then nltk will search the system for a ``megam`` binary; and if
        one is not found, it will raise a ``LookupError`` exception.
    :type bin: str
    """
    global _megam_bin
    _megam_bin = find_binary(
        'megam', bin,
        env_vars=['MEGAM'],
        binary_names=['megam.opt', 'megam', 'megam_686', 'megam_i686.opt'],
        url='http://www.umiacs.umd.edu/~hal/megam/index.html')

The best way to do this to copy the binary version megam into the system binary path like /usr/bin or /usr/local/bin, then you never need use the to config_megam every time you use it for Maximum Entropy Modeling in the NLTK.

sudo cp megam /usr/local/bin/
sudo cp megam.opt /usr/local/bin/

Just use it like this:

 
In [1]: import random
 
In [2]: from nltk.corpus import names
 
In [3]: from nltk import MaxentClassifier
 
In [5]: from nltk import classify
 
In [7]: names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
 
In [8]: random.shuffle(names)
 
In [10]: def gender_features3(name):
        features = {}
        features["fl"] = name[0].lower()
        features["ll"] = name[-1].lower()
        features["fw"] = name[:2].lower()
        features["lw"] = name[-2:].lower()
        return features
   ....: 
 
In [11]: featuresets = [(gender_features3(n), g) for (n, g) in names]
 
In [12]: train_set, test_set = featuresets[500:], featuresets[:500]
 
In [17]: me3_megam_classifier = MaxentClassifier.train(train_set, "megam")
[Found megam: megam]
Scanning file...7444 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 4.640e-01 pp 6.38216e-01 er 0.37413
it 2   dw 2.065e-01 pp 5.74892e-01 er 0.37413
it 3   dw 3.503e-01 pp 5.43226e-01 er 0.24328
it 4   dw 1.209e-01 pp 5.29406e-01 er 0.22394
it 5   dw 4.864e-01 pp 5.27097e-01 er 0.26115
it 6   dw 5.765e-01 pp 4.92409e-01 er 0.23415
it 7   dw 0.000e+00 pp 4.92409e-01 er 0.23415
-------------------------
it 1 dw 1.802e-01 pp 4.74930e-01 er 0.21588
it 2   dw 3.478e-02 pp 4.70876e-01 er 0.21548
it 3   dw 1.963e-01 pp 4.61761e-01 er 0.21709
it 4   dw 9.624e-02 pp 4.56257e-01 er 0.21574
it 5   dw 3.442e-01 pp 4.54401e-01
......
it 10  dw 2.399e-03 pp 3.71020e-01 er 0.16967
it 11  dw 2.202e-02 pp 3.71017e-01 er 0.16980
it 12  dw 0.000e+00 pp 3.71017e-01 er 0.16980
-------------------------
it 1 dw 2.620e-02 pp 3.70816e-01 er 0.17020
it 2   dw 2.285e-02 pp 3.70721e-01 er 0.16953
it 3   dw 1.074e-02 pp 3.70631e-01 er 0.16980
it 4   dw 3.152e-02 pp 3.70580e-01 er 0.16994
it 5   dw 2.263e-02 pp 3.70504e-01 er 0.16940
it 6   dw 1.115e-01 pp 3.70370e-01 er 0.16886
it 7   dw 1.938e-01 pp 3.70318e-01 er 0.16913
it 8   dw 1.365e-01 pp 3.69815e-01 er 0.16940
it 9   dw 2.634e-01 pp 3.69366e-01 er 0.16873
it 10  dw 2.498e-01 pp 3.69290e-01 er 0.17007
it 11  dw 2.515e-01 pp 3.69240e-01 er 0.16994
it 12  dw 3.027e-01 pp 3.69234e-01 er 0.16994
it 13  dw 9.850e-03 pp 3.69233e-01 er 0.16994
it 14  dw 1.152e-01 pp 3.69214e-01 er 0.16994
it 15  dw 0.000e+00 pp 3.69214e-01 er 0.16994
 
In [18]: classify.accuracy(me3_megam_classifier, test_set)
Out[18]: 0.812

The Megam using in NLTK depends the python subprocess.Popen, and it’s very fast and costs fewer resources than the original GIS or IIS Maxent module in NLTK. I have also compiled it in my local mac pro, and meet the same problems like the linux ubuntu compile process. You can also reference this article to compile the Megam in MAC OS: Compiling MegaM on MacOS X.

That’s the end, just enjoy using Megam in your NLTK or Python Projet.


Dive Into NLTK, Part IX: From Text Classification to Sentiment Analysis

According wikipediaSentiment Analysis is defined like this:

Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.

Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

Generally speaking, sentiment analysis can be seen as one task of text classification. Based on the movie review data from NLTK, we can train a basic text classification model for sentiment analysis:

Python 2.7.6 (default, Jun  3 2014, 07:43:23) 
Type "copyright", "credits" or "license" for more information.
 
IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: import nltk
 
In [2]: from nltk.corpus import movie_reviews
 
In [3]: from random import shuffle
 
In [4]: documents = [(list(movie_reviews.words(fileid)), category) 
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
 
In [5]: shuffle(documents)
 
In [6]: print documents[0]
([u'the', u'general', u"'", u's', u'daughter', u'will', u'probably', u'be', u'the', u'cleverest', u'stupid', u'film', u'we', u"'", u'll', u'see', u'this', u'year', u'--', u'or', u'perhaps', u'the', u'stupidest', u'clever', u'film', u'.', u'it', u"'", u's', u'confusing', u'to', u'a', u'critic', u'when', u'so', u'much', u'knuckleheaded', u'plotting', u'and', u'ostentatious', u'direction', u'shares', u'the', u'screen', u'with', u'so', u'much', u'snappy', u'dialogue', u'and', u'crisp', u'character', u'interaction', u'.', u'that', u',', u'however', u',', u'is', u'what', u'happens', u'when', u'legendary', u'screenwriter', u'william', u'goldman', u'takes', u'a', u'pass', u'at', u'an', u'otherwise', u'brutally', u'predictable', u'conspiracy', u'thriller', u'.', u'the', u'punched', u'-', u'up', u'punch', u'lines', u'are', u'ever', u'on', u'the', u'verge', u'of', u'convincing', u'you', u'the', u'general', u"'", u's', u'daughter', u'has', u'a', u'brain', u'in', u'its', u'head', u',', u'even', u'as', u'the', u'remaining', u'75', u'%', u'of', u'the', u'narrative', u'punches', u'you', u'in', u'the', u'face', u'with', u'its', u'lack', u'of', u'common', u'sense', u'.', u'our', u'hero', u'is', u'warrant', u'officer', u'paul', u'brenner', u',', u'a', u'brash', u'investigator', u'for', u'the', u'u', u'.', u's', u'.', u'army', u"'", u's', u'criminal', u'investigation', u'division', u'.', u'his', u'latest', u'case', u'is', u'the', u'murder', u'of', u'captain', u'elisabeth', u'campbell', u'(', u'leslie', u'stefanson', u')', u'at', u'a', u'georgia', u'base', u',', u'the', u'victim', u'found', u'tied', u'to', u'the', u'ground', u'after', u'an', u'apparent', u'sexual', u'assault', u'and', u'strangulation', u'.', u'complicating', u'the', u'case', u'is', u'the', u'fact', u'that', u'capt', u'.', u'campbell', u'is', u'the', u'daughter', u'of', u'general', u'joe', u'campbell', u'(', u'james', u'cromwell', u')', u',', u'a', u'war', u'hero', u'and', u'potential', u'vice', u'-', u'presidential', u'nominee', u'.', 
......
u'general', u'campbell', u'wants', u'to', u'keep', u'the', u'case', u'out', u'of', u'the', u'press', u',', u'which', u'gives', u'brenner', u'only', u'the', u'36', u'hours', u'before', u'the', u'fbi', u'steps', u'in', u'.', u'teamed', u'with', u'rape', u'investigator', u'sarah', u'sunhill', u'(', u'madeleine', u'stowe', u')', u'--', u'who', u',', u'coincidentally', u'enough', u',', u'once', u'had', u'a', u'romantic', u'relationship', u'with', u'brenner', u'--', u'brenner', u'begins', u'uncovering', \ u'evidence', u'out', u'of', u'the', u'corner', u'of', u'his', u'eye', u')', u'.', u'by', u'the', u'time', u'the', u'general', u"'", u's', u'daughter', u'wanders', u'towards', u'its', u'over', u'-', u'wrought', u',', u'psycho', u'-', u'in', u'-', u'the', u'-', u'rain', u'finale', u',', u'west', u"'", u's', u'heavy', u'hand', u'has', u'obliterated', u'most', u'of', u'what', u'made', u'the', u'film', u'occasionally', u'fun', u'.', u'it', u"'", u's', u'silly', u'and', u'pretentious', u'film', u'-', u'making', u',', u'but', u'at', u'least', u'it', u'provides', u'a', u'giggle', u'or', u'five', u'.', u'goldman', u'should', u'tear', u'the', u'15', u'decent', u'pages', u'out', u'of', u'this', u'script', u'and', u'turn', u'them', u'into', u'a', u'stand', u'-', u'up', u'routine', u'.'], u'neg')
 
 
# The total number of movie reviews documents in nltk is 2000
In [7]: len(documents)
Out[7]: 2000
 
 
# Construct a list of the 2,000 most frequent words in the overall corpus 
In [8]: all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
 
In [9]: word_features = all_words.keys()[:2000]
 
# Define a feature extractor that simply checks whether each of these words is present in a given document.
In [10]: def document_features(document):
   ....:     document_words = set(document)
   ....:     features = {}
   ....:     for word in word_features:
   ....:         features['contains(%s)' % word] = (word in document_words)
   ....:     return features
   ....: 
 
 
In [11]: print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{u'contains(waste)': False, u'contains(lot)': False, u'contains(*)': True, u'contains(black)': False, u'contains(rated)': False, u'contains(potential)': False, u'contains(m)': False, u'contains(understand)': False, u'contains(drug)': True, u'contains(case)': False, u'contains(created)': False, u'contains(kiss)': False, u'contains(needed)': False, u'contains(c)': False, u'contains(about)': True, u'contains(toy)': False, u'contains(longer)': False, u'contains(ready)': False, u'contains(certainly)': False, 
......
u'contains(good)': False, u'contains(live)': False, u'contains(appropriate)': False, u'contains(towards)': False, u'contains(smile)': False, u'contains(cross)': False}
 
# Generate the feature sets for the movie review documents one by one
In [12]: featuresets = [(document_features(d), c) for (d, c) in documents]
 
# Define the train set (1900 documents) and test set (100 documents)
In [13]: train_set, test_set = featuresets[100:], featuresets[:100]
 
# Train a naive bayes classifier with train set by nltk
In [14]: classifier = nltk.NaiveBayesClassifier.train(train_set)
 
# Get the accuracy of the naive bayes classifier with test set
In [15]: print nltk.classify.accuracy(classifier, test_set)
0.81
 
# Debug info: show top n most informative features
In [16]: classifier.show_most_informative_features(10)
Most Informative Features
   contains(outstanding) = True              pos : neg    =     13.3 : 1.0
         contains(mulan) = True              pos : neg    =      8.8 : 1.0
        contains(seagal) = True              neg : pos    =      8.0 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.5 : 1.0
         contains(damon) = True              pos : neg    =      6.2 : 1.0
         contains(awful) = True              neg : pos    =      6.0 : 1.0
        contains(wasted) = True              neg : pos    =      5.9 : 1.0
          contains(lame) = True              neg : pos    =      5.8 : 1.0
         contains(flynt) = True              pos : neg    =      5.5 : 1.0
        contains(poorly) = True              neg : pos    =      5.1 : 1.0

Based on the top-2000 word features, we can train a Maximum entropy classifier model with NLTK and MEGAM:

In [17]: maxent_classifier = nltk.MaxentClassifier.train(train_set, "megam")
[Found megam: /usr/local/bin/megam]
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 2.415e-03 pp 6.85543e-01 er 0.49895
it 2   dw 1.905e-03 pp 6.72937e-01 er 0.48895
it 3   dw 7.755e-03 pp 6.53779e-01 er 0.19526
it 4   dw 1.583e-02 pp 6.30863e-01 er 0.33526
it 5   dw 4.763e-02 pp 5.89126e-01 er 0.33895
it 6   dw 8.723e-02 pp 5.09921e-01 er 0.21211
it 7   dw 2.223e-01 pp 4.13823e-01 er 0.17000
it 8   dw 2.183e-01 pp 3.81889e-01 er 0.16526
it 9   dw 3.448e-01 pp 3.79054e-01 er 0.17421
it 10  dw 7.749e-02 pp 3.73549e-01 er 0.17105
it 11  dw 1.413e-01 pp 3.61806e-01 er 0.15842
it 12  dw 1.380e-01 pp 3.61716e-01 er 0.16000
it 13  dw 5.230e-02 pp 3.59953e-01 er 0.16053
it 14  dw 1.092e-01 pp 3.58713e-01 er 0.16211
it 15  dw 1.252e-01 pp 3.58669e-01 er 0.16000
it 16  dw 1.370e-01 pp 3.57027e-01 er 0.16105
it 17  dw 2.213e-01 pp 3.56230e-01 er 0.15684
it 18  dw 1.397e-01 pp 3.51368e-01 er 0.15579
it 19  dw 7.718e-01 pp 3.38156e-01 er 0.14947
it 20  dw 6.426e-02 pp 3.36342e-01 er 0.14947
it 21  dw 1.531e-01 pp 3.33402e-01 er 0.15053
it 22  dw 1.047e-01 pp 3.33287e-01 er 0.14895
it 23  dw 1.379e-01 pp 3.30814e-01 er 0.14895
it 24  dw 1.480e+00 pp 3.02938e-01 er 0.12842
it 25  dw 0.000e+00 pp 3.02938e-01 er 0.12842
-------------------------
......
......
-------------------------
it 1 dw 1.981e-05 pp 8.59536e-02 er 0.00684
it 2   dw 4.179e-05 pp 8.58979e-02 er 0.00684
it 3   dw 3.792e-04 pp 8.56536e-02 er 0.00684
it 4   dw 1.076e-03 pp 8.52961e-02 er 0.00737
it 5   dw 2.007e-03 pp 8.49459e-02 er 0.00737
it 6   dw 4.055e-03 pp 8.42942e-02 er 0.00737
it 7   dw 2.664e-02 pp 8.16976e-02 er 0.00526
it 8   dw 1.888e-02 pp 8.12042e-02 er 0.00316
it 9   dw 5.093e-02 pp 8.08672e-02 er 0.00316
it 10  dw 3.968e-03 pp 8.08624e-02 er 0.00316
it 11  dw 0.000e+00 pp 8.08624e-02 er 0.00316
 
In [18]: print nltk.classify.accuracy(maxent_classifier, test_set)
0.89
 
In [19]: maxent_classifier.show_most_informative_features(10)
  -1.843 contains(waste)==False and label is u'neg'
  -1.006 contains(boring)==False and label is u'neg'
  -0.979 contains(worst)==False and label is u'neg'
  -0.973 contains(bad)==False and label is u'neg'
  -0.953 contains(unfortunately)==False and label is u'neg'
  -0.864 contains(lame)==False and label is u'neg'
  -0.850 contains(attempt)==False and label is u'neg'
  -0.833 contains(supposed)==False and label is u'neg'
  -0.815 contains(seen)==True and label is u'neg'
  -0.783 contains(laughable)==False and label is u'neg'

It seems that the maxent classifier has the better classifier result on the test set. Let’s classify a test text with the Naive Bayes Classifier and Maxent Classifier:

In [22]:  test_text = "I love this movie, very interesting"
 
In [23]: test_set = document_features(test_text.split())
 
In [24]: test_set
Out[24]: 
{u'contains(waste)': False,
 u'contains(lot)': False,
 u'contains(*)': False,
 u'contains(black)': False,
 u'contains(rated)': False,
 u'contains(potential)': False,
 u'contains(m)': False,
 u'contains(understand)': False,
 u'contains(drug)': False,
 u'contains(case)': False,
 u'contains(created)': False,
 u'contains(kiss)': False,
 u'contains(needed)': False,
 ......
 u'contains(happens)': False,
 u'contains(suddenly)': False,
 u'contains(almost)': False,
 u'contains(evil)': False,
 u'contains(building)': False,
 u'contains(michael)': False,
 ...}
 
# Naivebayes classifier get the wrong result
In [25]: print classifier.classify(test_set)
neg
 
# Maxent Classifier done right
In [26]: print maxent_classifier.classify(test_set)
pos
 
# Let's see the probability result
In [27]: prob_result = classifier.prob_classify(test_set)
 
In [28]: prob_result
Out[28]: <ProbDist with 2 samples>
 
In [29]: prob_result.max()
Out[29]: u'neg'
 
In [30]: prob_result.prob("neg")
Out[30]: 0.99999917093621
 
In [31]: prob_result.prob("pos")
Out[31]: 8.29063793272753e-07
 
# Maxent classifier probability result
In [32]: print maxent_classifier.classify(test_set)
pos
 
In [33]: prob_result = maxent_classifier.prob_classify(test_set)
 
In [33]: prob_result.prob("pos")
Out[33]: 0.67570114045832497
 
In [34]: prob_result.prob("neg")
Out[34]: 0.32429885954167498

Till now, we just used the top-n word features, and for this sentiment analysis machine learning problem, add more features may be get better result. So we redesign the word features:

In [40]: def bag_of_words(words):
   ....:     return dict([(word, True) for word in words])
   ....: 
 
In [43]: data_sets = [(bag_of_words(d), c) for (d, c) in documents]
 
In [44]: len(data_sets)
Out[44]: 2000
 
In [45]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [46]: bayes_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [47]: print nltk.classify.accuracy(bayes_classifier, test_set)
0.8
 
In [48]: bayes_classifier.show_most_informative_features(10)
Most Informative Features
             outstanding = True              pos : neg    =     13.9 : 1.0
                  avoids = True              pos : neg    =     13.1 : 1.0
              astounding = True              pos : neg    =     11.7 : 1.0
                 insipid = True              neg : pos    =     11.0 : 1.0
                    3000 = True              neg : pos    =     11.0 : 1.0
               insulting = True              neg : pos    =     10.6 : 1.0
            manipulation = True              pos : neg    =     10.4 : 1.0
             fascination = True              pos : neg    =     10.4 : 1.0
                    slip = True              pos : neg    =     10.4 : 1.0
               ludicrous = True              neg : pos    =     10.1 : 1.0
 
In [49]: maxent_bg_classifier = nltk.MaxentClassifier.train(train_set, "megam")
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 1.255e-01 pp 3.91521e-01 er 0.15368
it 2   dw 1.866e-02 pp 3.82995e-01 er 0.14684
it 3   dw 3.912e-02 pp 3.46794e-01 er 0.13368
it 4   dw 5.916e-02 pp 3.26135e-01 er 0.13684
it 5   dw 2.929e-02 pp 3.23077e-01 er 0.13474
it 6   dw 2.552e-02 pp 3.15917e-01 er 0.13526
it 7   dw 2.765e-02 pp 3.14291e-01 er 0.13526
it 8   dw 8.298e-02 pp 2.35472e-01 er 0.07263
it 9   dw 1.357e-01 pp 2.20265e-01 er 0.08684
it 10  dw 6.186e-02 pp 2.03567e-01 er 0.07158
it 11  dw 2.057e-01 pp 1.69049e-01 er 0.05316
it 12  dw 1.319e-01 pp 1.61575e-01 er 0.05263
it 13  dw 8.872e-02 pp 1.59902e-01 er 0.05526
it 14  dw 5.907e-02 pp 1.59254e-01 er 0.05632
it 15  dw 4.443e-02 pp 1.54540e-01 er 0.05368
it 16  dw 3.677e-01 pp 1.48646e-01 er 0.03842
it 17  dw 2.500e-01 pp 1.47460e-01 er 0.03947
it 18  dw 9.548e-01 pp 1.44516e-01 er 0.03842
it 19  dw 3.466e-01 pp 1.42935e-01 er 0.04211
it 20  dw 1.872e-02 pp 1.42847e-01 er 0.04263
it 21  dw 1.452e-01 pp 1.28344e-01 er 0.02737
it 22  dw 1.248e-01 pp 1.24428e-01 er 0.02526
it 23  dw 4.071e-01 pp 1.18201e-01 er 0.02211
it 24  dw 3.979e-01 pp 1.08352e-01 er 0.01526
it 25  dw 1.871e-01 pp 1.08345e-01 er 0.01632
it 26  dw 8.477e-02 pp 1.07972e-01 er 0.01579
it 27  dw 0.000e+00 pp 1.07972e-01 er 0.01579
-------------------------
.......
-------------------------
it 12  dw 4.018e-02 pp 1.73432e-05 er 0.00000
it 13  dw 3.898e-02 pp 1.62334e-05 er 0.00000
it 14  dw 9.937e-02 pp 1.52647e-05 er 0.00000
it 15  dw 5.558e-02 pp 1.31892e-05 er 0.00000
it 16  dw 5.646e-02 pp 1.30511e-05 er 0.00000
it 17  dw 1.100e-01 pp 1.23914e-05 er 0.00000
it 18  dw 4.541e-02 pp 1.17382e-05 er 0.00000
it 19  dw 1.316e-01 pp 1.04446e-05 er 0.00000
it 20  dw 1.919e-01 pp 9.04729e-06 er 0.00000
it 21  dw 1.039e-02 pp 9.02896e-06 er 0.00000
it 22  dw 2.843e-01 pp 8.92068e-06 er 0.00000
it 23  dw 1.100e-01 pp 8.54637e-06 er 0.00000
it 24  dw 2.199e-01 pp 8.36371e-06 er 0.00000
it 25  dw 2.428e-02 pp 8.24041e-06 er 0.00000
it 26  dw 0.000e+00 pp 8.24041e-06 er 0.00000
 
In [50]: print nltk.classify.accuracy(maxent_bg_classifier, test_set)
0.89
 
In [51]: maxent_bg_classifier.show_most_informative_features(10)
  -4.151 get==True and label is u'neg'
  -2.961 get==True and label is u'pos'
  -2.596 all==True and label is u'neg'
  -2.523 out==True and label is u'pos'
  -2.400 years==True and label is u'neg'
  -2.397 its==True and label is u'pos'
  -2.340 them==True and label is u'neg'
  -2.327 out==True and label is u'neg'
  -2.324 ,==True and label is u'neg'
  -2.259 (==True and label is u'neg'

Now we can test the bigrams feature in the classifier model:

In [52]: from nltk import ngrams
 
In [53]: def bag_of_ngrams(words, n=2):
   ....:     ngs = [ng for ng in iter(ngrams(words, n))]
   ....:     return bag_of_words(ngs)
   ....: 
 
In [54]: data_sets = [(bag_of_ngrams(d), c) for (d, c) in documents]
 
In [55]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [56]: nb_bi_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [57]: print nltk.classify.accuracy(nb_bi_classifier, test_set)
0.83
 
In [59]: nb_bi_classifier.show_most_informative_features(10)
Most Informative Features
    (u'is', u'terrific') = True              pos : neg    =     17.1 : 1.0
      (u'not', u'funny') = True              neg : pos    =     16.9 : 1.0
     (u'boring', u'and') = True              neg : pos    =     13.6 : 1.0
     (u'and', u'boring') = True              neg : pos    =     13.6 : 1.0
        (u'our', u'own') = True              pos : neg    =     13.1 : 1.0
        (u'why', u'did') = True              neg : pos    =     12.9 : 1.0
    (u'enjoyable', u',') = True              pos : neg    =     12.4 : 1.0
     (u'works', u'well') = True              pos : neg    =     12.4 : 1.0
      (u'.', u'cameron') = True              pos : neg    =     12.4 : 1.0
     (u'well', u'worth') = True              pos : neg    =     12.4 : 1.0
 
In [60]: maxent_bi_classifier = nltk.MaxentClassifier.train(train_set, "megam")
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 6.728e-02 pp 4.68710e-01 er 0.25895
it 2   dw 6.127e-02 pp 3.37578e-01 er 0.13789
it 3   dw 1.712e-02 pp 2.94106e-01 er 0.11737
it 4   dw 2.538e-02 pp 2.68465e-01 er 0.11526
it 5   dw 3.965e-02 pp 2.46789e-01 er 0.10684
it 6   dw 1.240e-01 pp 1.98149e-01 er 0.07947
it 7   dw 1.640e-02 pp 1.62956e-01 er 0.05895
it 8   dw 1.320e-01 pp 1.07163e-01 er 0.02789
it 9   dw 1.233e-01 pp 8.79358e-02 er 0.01368
it 10  dw 2.815e-01 pp 5.51191e-02 er 0.00737
it 11  dw 1.127e-01 pp 3.91500e-02 er 0.00421
it 12  dw 3.463e-01 pp 2.95846e-02 er 0.00211
it 13  dw 1.114e-01 pp 2.90701e-02 er 0.00053
it 14  dw 1.453e-01 pp 1.95422e-02 er 0.00053
it 15  dw 1.976e-01 pp 1.54022e-02 er 0.00105
......
it 44  dw 2.544e-01 pp 9.05755e-15 er 0.00000
it 45  dw 4.974e-02 pp 9.02763e-15 er 0.00000
it 46  dw 9.311e-07 pp 9.02483e-15 er 0.00000
it 47  dw 0.000e+00 pp 9.02483e-15 er 0.00000
 
In [61]: print nltk.classify.accuracy(maxent_bi_classifier, test_set)
0.9
 
In [62]: maxent_bi_classifier.show_most_informative_features(10)
 -14.152 (u'a', u'man')==True and label is u'neg'
  12.821 (u'"', u'the')==True and label is u'neg'
 -12.399 (u'of', u'the')==True and label is u'neg'
 -11.881 (u'a', u'man')==True and label is u'pos'
  10.020 (u',', u'which')==True and label is u'neg'
   8.418 (u'and', u'that')==True and label is u'neg'
  -8.022 (u'and', u'the')==True and label is u'neg'
  -7.191 (u'on', u'a')==True and label is u'neg'
  -7.185 (u'on', u'a')==True and label is u'pos'
   7.107 (u',', u'which')==True and label is u'pos'

And again, we can use the words feature and ngrams (bigrams) feature together:

In [63]: def bag_of_all(words, n=2):
   ....:     all_features = bag_of_words(words)
   ....:     ngram_features = bag_of_ngrams(words, n=n)
   ....:     all_features.update(ngram_features)   
   ....:     return all_features
   ....: 
 
In [64]: data_sets = [(bag_of_all(d), c) for (d, c) in documents]
 
In [65]: train_set, test_set = data_sets[100:], data_sets[:100]
 
In [66]: nb_all_classifier = nltk.NaiveBayesClassifier.train(train_set)
 
In [67]: print nltk.classify.accuracy(nb_all_classifier, test_set)
0.83
 
In [68]: nb_all_classifier.show_most_informative_features(10)
Most Informative Features
    (u'is', u'terrific') = True              pos : neg    =     17.1 : 1.0
      (u'not', u'funny') = True              neg : pos    =     16.9 : 1.0
             outstanding = True              pos : neg    =     13.9 : 1.0
     (u'boring', u'and') = True              neg : pos    =     13.6 : 1.0
     (u'and', u'boring') = True              neg : pos    =     13.6 : 1.0
                  avoids = True              pos : neg    =     13.1 : 1.0
        (u'our', u'own') = True              pos : neg    =     13.1 : 1.0
        (u'why', u'did') = True              neg : pos    =     12.9 : 1.0
    (u'enjoyable', u',') = True              pos : neg    =     12.4 : 1.0
     (u'works', u'well') = True              pos : neg    =     12.4 : 1.0
 
 
In [71]: maxent_all_classifier = nltk.MaxentClassifier.train(train_set, "megam") 
Scanning file...1900 train, 0 dev, 0 test, reading...done
Warning: there only appear to be two classes, but we're
         optimizing with BFGS...using binary optimization
         with CG would be much faster
optimizing with lambda = 0
it 1   dw 8.715e-02 pp 3.82841e-01 er 0.17684
it 2   dw 2.846e-02 pp 2.97371e-01 er 0.11632
it 3   dw 1.299e-02 pp 2.79797e-01 er 0.11421
it 4   dw 2.456e-02 pp 2.64735e-01 er 0.11053
it 5   dw 4.200e-02 pp 2.47440e-01 er 0.10789
it 6   dw 1.417e-01 pp 2.04814e-01 er 0.08737
it 7   dw 1.330e-02 pp 2.03060e-01 er 0.08737
it 8   dw 3.177e-02 pp 1.92654e-01 er 0.08421
it 9   dw 5.613e-02 pp 1.38725e-01 er 0.05789
it 10  dw 1.339e-01 pp 7.92844e-02 er 0.02368
it 11  dw 1.734e-01 pp 6.71341e-02 er 0.01316
it 12  dw 1.313e-01 pp 6.55828e-02 er 0.01263
it 13  dw 2.036e-01 pp 6.38482e-02 er 0.01421
it 14  dw 1.230e-02 pp 5.96907e-02 er 0.01368
it 15  dw 9.719e-02 pp 4.03190e-02 er 0.00842
it 16  dw 4.004e-02 pp 3.98276e-02 er 0.00737
it 17  dw 1.598e-01 pp 2.68187e-02 er 0.00316
it 18  dw 1.900e-01 pp 2.57116e-02 er 0.00211
it 19  dw 4.355e-01 pp 2.14572e-02 er 0.00263
it 20  dw 1.029e-01 pp 1.91407e-02 er 0.00211
it 21  dw 1.347e-01 pp 1.46859e-02 er 0.00105
it 22  dw 2.231e-01 pp 1.26997e-02 er 0.00053
it 23  dw 2.942e-01 pp 1.20663e-02 er 0.00000
it 24  dw 3.836e-01 pp 1.14817e-02 er 0.00000
it 25  dw 4.213e-01 pp 9.89037e-03 er 0.00000
it 26  dw 1.875e-01 pp 7.06744e-03 er 0.00000
it 27  dw 2.865e-01 pp 5.61255e-03 er 0.00000
it 28  dw 5.903e-01 pp 4.94776e-03 er 0.00000
it 29  dw 0.000e+00 pp 4.94776e-03 er 0.00000
-------------------------
.......
-------------------------
.......
it 8   dw 2.024e-01 pp 8.14623e-10 er 0.00000
it 9   dw 9.264e-02 pp 7.87683e-10 er 0.00000
it 10  dw 5.845e-02 pp 7.38397e-10 er 0.00000
it 11  dw 2.418e-01 pp 6.34000e-10 er 0.00000
it 12  dw 5.081e-01 pp 6.19061e-10 er 0.00000
it 13  dw 0.000e+00 pp 6.19061e-10 er 0.00000
 
In [72]: print nltk.classify.accuracy(maxent_all_classifier, test_set)
0.91
 
In [73]: maxent_all_classifier.show_most_informative_features(10)
  11.220 to==True and label is u'neg'
   3.415 ,==True and label is u'neg'
   3.360 '==True and label is u'neg'
   3.310 this==True and label is u'neg'
   3.243 a==True and label is u'neg'
  -3.218 (u'does', u'a')==True and label is u'neg'
   3.164 have==True and label is u'neg'
  -3.024 what==True and label is u'neg'
   2.966 more==True and label is u'neg'
  -2.891 (u',', u'which')==True and label is u'neg'

We get the best sentiment analysis performance on this case, although there were some problems, such as the punctuations and stop words were not discarded. Just as a case study, we encourage you to test on more data or more features, or better machine learning models such as deep learning method.


Dive Into NLTK, Part X: Play with Word2Vec Models based on NLTK Corpus

NLTK Corpus

Accessing Text Corpora in NLTK is very easily. NLTK provides a NLTK Corpus Package to read and manage the corpus data. For example, NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which name Gutenberg Corpus. About Project Gutenberg:

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.

We can list the ebook file name of the Gutenberg Corpus in NLTK like this:

Python 2.7.6 (default, Jun  3 2014, 07:43:23) 
Type "copyright", "credits" or "license" for more information.
 
IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from nltk.corpus import gutenberg
 
In [2]: gutenberg.fileids()
Out[2]: 
[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']
 
In [3]: austen_emma_words = gutenberg.words('austen-emma.txt')
 
In [4]: len(austen_emma_words)
Out[4]: 192427
 
In [5]: austen_emma_sents = gutenberg.sents('austen-emma.txt')
 
In [6]: len(austen_emma_sents)
Out[6]: 9111
 
In [7]: austen_emma_sents[0]
Out[7]: [u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
 
In [8]: austen_emma_sents[5000]
Out[8]: 
[u'I',
 u'know',
 u'the',
 u'danger',
 u'of',
 u'indulging',
 u'such',
 u'speculations',
 u'.']

Word2Vec

Narrowly speaking, the Word2Vec we said here is referred to Google Word2Vec Project, which first proposed by Tomas Mikolov and etc in 2013. For more Word2Vec related papers, tutorials, and coding examples, we recommend the “Getting started with Word2Vec” by TextProcessing.

Word2Vec in Python

The best python implementation of Word2Vec is Gensim Word2vec module: models.word2vec – Deep learning with word2vec. We have written an article about “Word2Vec in Python“, you can reference it first if you have no idea about gensim word2vec model: Getting Started with Word2Vec and GloVe in Python.

Bible Word2Vec Model

The Gutenberg Corpus in NLTK include a file named ‘bible-kjv.txt’, which is “The King James Version of the Bible“:

The King James Version (KJV), also known as Authorized [sic] Version (AV) or simply King James Bible (KJB), is an English translation of the Christian Bible for the Church of England begun in 1604 and completed in 1611.[a] The books of the King James Version include the 39 books of the Old Testament, an intertestamental section containing 14 books of the Apocrypha, and the 27 books of the New Testament.

Let’s train a Bible Word2Vec Model based on bible-kjv.txt corpus and gensim word2vec module:

In [14]: import logging
 
In [15]: from gensim.models import word2vec
 
In [16]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
In [17]: bible_kjv_words = gutenberg.words('bible-kjv.txt')
 
In [18]: len(bible_kjv_words)
Out[18]: 1010654
 
In [19]: bible_kjv_sents = gutenberg.sents('bible-kjv.txt')  
 
In [20]: len(bible_kjv_sents)
Out[20]: 30103
 
In [21]: bible_kjv_sents[0]
Out[21]: [u'[', u'The', u'King', u'James', u'Bible', u']']
 
In [22]: bible_kjv_sents[1]
Out[22]: [u'The', u'Old', u'Testament', u'of', u'the', u'King', u'James', u'Bible']
 
In [23]: bible_kjv_sents[2]
Out[23]: [u'The', u'First', u'Book', u'of', u'Moses', u':', u'Called', u'Genesis']
 
In [24]: bible_kjv_sents[3]
Out[24]: 
[u'1',
 u':',
 u'1',
 u'In',
 u'the',
 u'beginning',
 u'God',
 u'created',
 u'the',
 u'heaven',
 u'and',
 u'the',
 u'earth',
 u'.']
 
In [25]: from string import punctuation
 
In [27]: punctuation
Out[27]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
 
# The words in nltk corpus has been word tokenized, so we just discard the punctuation from the words
# and lowercased the words.
In [29]: discard_punctuation_and_lowercased_sents = [[word.lower() for word in sent if word not in punctuation] for sent in bible_kjv_sents]
 
In [30]: discard_punctuation_and_lowercased_sents[0]
Out[30]: [u'the', u'king', u'james', u'bible']
 
In [31]: discard_punctuation_and_lowercased_sents[1]
Out[31]: [u'the', u'old', u'testament', u'of', u'the', u'king', u'james', u'bible']
 
In [32]: discard_punctuation_and_lowercased_sents[2]
Out[32]: [u'the', u'first', u'book', u'of', u'moses', u'called', u'genesis']
 
In [33]: discard_punctuation_and_lowercased_sents[3]
Out[33]: 
[u'1',
 u'1',
 u'in',
 u'the',
 u'beginning',
 u'god',
 u'created',
 u'the',
 u'heaven',
 u'and',
 u'the',
 u'earth']
 
In [34]: bible_kjv_word2vec_model = word2vec.Word2Vec(discard_punctuation_and_lowercased_sents, min_count=5, size=200)
2017-03-26 21:05:20,811 : INFO : collecting all words and their counts
2017-03-26 21:05:20,811 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-03-26 21:05:20,972 : INFO : PROGRESS: at sentence #10000, processed 315237 words, keeping 7112 word types
2017-03-26 21:05:21,103 : INFO : PROGRESS: at sentence #20000, processed 572536 words, keeping 10326 word types
2017-03-26 21:05:21,247 : INFO : PROGRESS: at sentence #30000, processed 851126 words, keeping 12738 word types
2017-03-26 21:05:21,249 : INFO : collected 12752 word types from a corpus of 854209 raw words and 30103 sentences
2017-03-26 21:05:21,249 : INFO : Loading a fresh vocabulary
2017-03-26 21:05:21,441 : INFO : min_count=5 retains 5428 unique words (42% of original 12752, drops 7324)
2017-03-26 21:05:21,441 : INFO : min_count=5 leaves 841306 word corpus (98% of original 854209, drops 12903)
2017-03-26 21:05:21,484 : INFO : deleting the raw counts dictionary of 12752 items
2017-03-26 21:05:21,485 : INFO : sample=0.001 downsamples 62 most-common words
2017-03-26 21:05:21,485 : INFO : downsampling leaves estimated 583788 word corpus (69.4% of prior 841306)
2017-03-26 21:05:21,485 : INFO : estimated required memory for 5428 words and 200 dimensions: 11398800 bytes
2017-03-26 21:05:21,520 : INFO : resetting layer weights
2017-03-26 21:05:21,708 : INFO : training model with 3 workers on 5428 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-03-26 21:05:21,708 : INFO : expecting 30103 sentences, matching count from corpus used for vocabulary survey
2017-03-26 21:05:22,721 : INFO : PROGRESS: at 16.10% examples, 474025 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:23,728 : INFO : PROGRESS: at 34.20% examples, 500893 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:24,734 : INFO : PROGRESS: at 49.48% examples, 482782 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:25,742 : INFO : PROGRESS: at 60.97% examples, 442365 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:26,758 : INFO : PROGRESS: at 76.39% examples, 443206 words/s, in_qsize 6, out_qsize 0
2017-03-26 21:05:27,770 : INFO : PROGRESS: at 95.14% examples, 460213 words/s, in_qsize 5, out_qsize 0
2017-03-26 21:05:28,002 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-03-26 21:05:28,007 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-03-26 21:05:28,013 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-03-26 21:05:28,013 : INFO : training on 4271045 raw words (2918320 effective words) took 6.3s, 463711 effective words/s 
 
In [35]: bible_kjv_word2vec_model.save("bible_word2vec_gensim")
2017-03-26 21:06:03,500 : INFO : saving Word2Vec object under bible_word2vec_gensim, separately None
2017-03-26 21:06:03,501 : INFO : not storing attribute syn0norm
2017-03-26 21:06:03,501 : INFO : not storing attribute cum_table
2017-03-26 21:06:03,646 : INFO : saved bible_word2vec_gensim
 
In [36]: bible_kjv_word2vec_model.wv.save_word2vec_format("bible_word2vec_org", "bible_word2vec_vocabulary")
2017-03-26 21:06:51,136 : INFO : storing vocabulary in bible_word2vec_vocabulary
2017-03-26 21:06:51,186 : INFO : storing 5428x200 projection weights into bible_word2vec_org

We have trained the bible word2vec model from the nltk bible-kjv corpus, now we can test it with some word.

First, “God”:

In [37]: bible_kjv_word2vec_model.most_similar(["god"])
2017-03-26 21:14:27,320 : INFO : precomputing L2-norms of word weight vectors
Out[37]: 
[(u'christ', 0.7863791584968567),
 (u'lord', 0.7807695865631104),
 (u'salvation', 0.772181510925293),
 (u'truth', 0.7689207792282104),
 (u'spirit', 0.7437840700149536),
 (u'faith', 0.7283919453620911),
 (u'glory', 0.7281145453453064),
 (u'mercy', 0.7187720537185669),
 (u'hosts', 0.7179254293441772),
 (u'gospel', 0.7167999148368835)]
 
In [38]: bible_kjv_word2vec_model.most_similar(["god"], topn=30)
Out[38]: 
[(u'christ', 0.7863791584968567),
 (u'lord', 0.7807695865631104),
 (u'salvation', 0.772181510925293),
 (u'truth', 0.7689207792282104),
 (u'spirit', 0.7437840700149536),
 (u'faith', 0.7283919453620911),
 (u'glory', 0.7281145453453064),
 (u'mercy', 0.7187720537185669),
 (u'hosts', 0.7179254293441772),
 (u'gospel', 0.7167999148368835),
 (u'grace', 0.6984926462173462),
 (u'kingdom', 0.6883569359779358),
 (u'word', 0.6729788184165955),
 (u'wisdom', 0.6717872023582458),
 (u'righteousness', 0.6678392291069031),
 (u'judgment', 0.6650925874710083),
 (u'hope', 0.6614011526107788),
 (u'fear', 0.6607920527458191),
 (u'power', 0.6554194092750549),
 (u'who', 0.6502907276153564),
 (u'law', 0.6491219401359558),
 (u'name', 0.6448863744735718),
 (u'commandment', 0.6375595331192017),
 (u'covenant', 0.6254858374595642),
 (u'thus', 0.618947446346283),
 (u'servant', 0.6186059713363647),
 (u'supplication', 0.6185135841369629),
 (u'prayer', 0.6138496398925781),
 (u'world', 0.6129024028778076),
 (u'strength', 0.6128018498420715)]

It’s not clear in the Python interpreter, we can play it with the Word2vec visualization demo:

Second, “Jesus”:

In [39]: bible_kjv_word2vec_model.most_similar(["jesus"], topn=30)
Out[39]: 
[(u'david', 0.7681761980056763),
 (u'moses', 0.7144877910614014),
 (u'peter', 0.6529564261436462),
 (u'paul', 0.6443019509315491),
 (u'saul', 0.6364974975585938),
 (u'jeremiah', 0.627094030380249),
 (u'who', 0.614865243434906),
 (u'joshua', 0.6089344024658203),
 (u'abraham', 0.6082783937454224),
 (u'esther', 0.6062946319580078),
 (u'john', 0.6049948930740356),
 (u'king', 0.6048218011856079),
 (u'balaam', 0.597400963306427),
 (u'christ', 0.597069501876831),
 (u'word', 0.5932905673980713),
 (u'samuel', 0.5922203063964844),
 (u'mordecai', 0.5911144614219666),
 (u'him', 0.5875434279441833),
 (u'prophet', 0.5794544219970703),
 (u'pharaoh', 0.5698455572128296),
 (u'messengers', 0.5664196014404297),
 (u'jacob', 0.5617039203643799),
 (u'daniel', 0.5585237741470337),
 (u'saying', 0.5573699474334717),
 (u'god', 0.5562971830368042),
 (u'thus', 0.5508617758750916),
 (u'sworn', 0.5429663062095642),
 (u'master', 0.5384684801101685),
 (u'esaias', 0.5353941321372986),
 (u'he', 0.5342475175857544)]

Word2vec visualization Demo for “Jesus”:

Third, “Abraham”:

In [40]: bible_kjv_word2vec_model.most_similar(["abraham"], topn=30)
Out[40]: 
[(u'isaac', 0.9007743000984192),
 (u'jacob', 0.8846864700317383),
 (u'esau', 0.7379217743873596),
 (u'joseph', 0.7284103035926819),
 (u'solomon', 0.7238803505897522),
 (u'daniel', 0.7140511274337769),
 (u'david', 0.7065653800964355),
 (u'moses', 0.6977373957633972),
 (u'attend', 0.6717492341995239),
 (u'jonadab', 0.6707518696784973),
 (u'hezekiah', 0.6678087711334229),
 (u'timothy', 0.6653313636779785),
 (u'jesse', 0.6586748361587524),
 (u'joshua', 0.6527853012084961),
 (u'pharaoh', 0.6472733020782471),
 (u'aaron', 0.6444283127784729),
 (u'church', 0.6429852247238159),
 (u'hamor', 0.6401337385177612),
 (u'jeremiah', 0.6318649649620056),
 (u'john', 0.6243793964385986),
 (u'nun', 0.6216053366661072),
 (u'jephunneh', 0.6153846979141235),
 (u'amoz', 0.6135494709014893),
 (u'praises', 0.6104753017425537),
 (u'joab', 0.609726071357727),
 (u'caleb', 0.6083548069000244),
 (u'jesus', 0.6082783341407776),
 (u'belteshazzar', 0.6075813174247742),
 (u'letters', 0.606890082359314),
 (u'confirmed', 0.606576681137085)]

Word2vec visualization demo for “Abraham”:

Finally, “Moses”:

In [41]: bible_kjv_word2vec_model.most_similar(["moses"], topn=30)
Out[41]: 
[(u'joshua', 0.8120298385620117),
 (u'jeremiah', 0.7481369972229004),
 (u'david', 0.7417373657226562),
 (u'jesus', 0.7144876718521118),
 (u'samuel', 0.7133205533027649),
 (u'abraham', 0.6977373957633972),
 (u'daniel', 0.6943730115890503),
 (u'paul', 0.6868774890899658),
 (u'balaam', 0.6845300793647766),
 (u'john', 0.6638209819793701),
 (u'hezekiah', 0.6563856601715088),
 (u'solomon', 0.6481155157089233),
 (u'letters', 0.6409181952476501),
 (u'messengers', 0.6316184997558594),
 (u'joseph', 0.6288775205612183),
 (u'esaias', 0.616001546382904),
 (u'joab', 0.6061952710151672),
 (u'pharaoh', 0.5786489844322205),
 (u'jacob', 0.5751779079437256),
 (u'church', 0.5716862678527832),
 (u'spake', 0.570705771446228),
 (u'balak', 0.5679874420166016),
 (u'peter', 0.5658782720565796),
 (u'nebuchadnezzar', 0.5637866258621216),
 (u'saul', 0.5635569095611572),
 (u'prophesied', 0.5563102960586548),
 (u'esther', 0.5491665601730347),
 (u'prayed', 0.5470476150512695),
 (u'isaac', 0.5445208549499512),
 (u'aaron', 0.5425761938095093)]

Word2vec visualization demo for “Moses”:

You can paly with other word2vec model based on the nltk corpus like this, just enjoy it.


Dive Into NLTK, Part XI: From Word2Vec to WordNet

About WordNet

WordNet is a lexical database for English:

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.

For more about WordNet Install and Test information, we recommended you refer: Getting started with WordNet

WordNet in NLTK

NLTK provides a fantastic python wordnet interface for managing words in WordNet: WordNet Interface, and the source code can be referenced here: Source code for nltk.corpus.reader.wordnet. We can use nltk to play with WordNet:

Python 2.7.6 (default, Jun  3 2014, 07:43:23) 
Type "copyright", "credits" or "license" for more information.
 
IPython 3.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]: from nltk.corpus import wordnet as wn
 
In [2]: wn.synsets('book')
Out[2]: 
[Synset('book.n.01'),
 Synset('book.n.02'),
 Synset('record.n.05'),
 Synset('script.n.01'),
 Synset('ledger.n.01'),
 Synset('book.n.06'),
 Synset('book.n.07'),
 Synset('koran.n.01'),
 Synset('bible.n.01'),
 Synset('book.n.10'),
 Synset('book.n.11'),
 Synset('book.v.01'),
 Synset('reserve.v.04'),
 Synset('book.v.03'),
 Synset('book.v.04')]
 
In [3]: wn.synsets('book', pos=wn.NOUN)
Out[3]: 
[Synset('book.n.01'),
 Synset('book.n.02'),
 Synset('record.n.05'),
 Synset('script.n.01'),
 Synset('ledger.n.01'),
 Synset('book.n.06'),
 Synset('book.n.07'),
 Synset('koran.n.01'),
 Synset('bible.n.01'),
 Synset('book.n.10'),
 Synset('book.n.11')]
 
In [4]: wn.synsets('book', pos=wn.VERB)
Out[4]: 
[Synset('book.v.01'),
 Synset('reserve.v.04'),
 Synset('book.v.03'),
 Synset('book.v.04')]
 
In [5]: wn.synset('book.n.01')
Out[5]: Synset('book.n.01')
 
In [6]: print(wn.synset('book.n.01').definition())
a written work or composition that has been published (printed on pages bound together)
 
In [7]: print(wn.synset('book.v.01').definition())
engage for a performance
 
In [8]: len(wn.synset('book.n.01').examples())
Out[8]: 1
 
In [9]: print(wn.synset('book.n.01').examples()[0])
I am reading a good book on economics
 
In [10]: len(wn.synset('book.v.01').examples())
Out[10]: 1
 
In [11]: print(wn.synset('book.v.01').examples()[0])
Her agent had booked her for several concerts in Tokyo
 
In [12]: wn.synset('book.n.01').lemmas()
Out[12]: [Lemma('book.n.01.book')]
 
In [13]: wn.synset('book.v.01').lemmas()
Out[13]: [Lemma('book.v.01.book')]
 
In [14]: [str(lemma.name()) for lemma in wn.synset('book.n.01').lemmas()]
Out[14]: ['book']
 
In [15]: [str(lemma.name()) for lemma in wn.synset('book.v.01').lemmas()]
Out[15]: ['book']
 
In [16]: wn.lemma('book.n.01.book').synset()
Out[16]: Synset('book.n.01')
 
In [17]: book = wn.synset('book.n.01')
 
In [18]: book.hypernyms()
Out[18]: [Synset('publication.n.01')]
 
In [19]: book.hyponyms()
Out[19]: 
[Synset('appointment_book.n.01'),
 Synset('authority.n.07'),
 Synset('bestiary.n.01'),
 Synset('booklet.n.01'),
 Synset('catalog.n.01'),
 Synset('catechism.n.02'),
 Synset('copybook.n.01'),
 Synset('curiosa.n.01'),
 Synset('formulary.n.01'),
 Synset('phrase_book.n.01'),
 Synset('playbook.n.02'),
 Synset('pop-up_book.n.01'),
 Synset('prayer_book.n.01'),
 Synset('reference_book.n.01'),
 Synset('review_copy.n.01'),
 Synset('songbook.n.01'),
 Synset('storybook.n.01'),
 Synset('textbook.n.01'),
 Synset('tome.n.01'),
 Synset('trade_book.n.01'),
 Synset('workbook.n.01'),
 Synset('yearbook.n.01')]
 
In [20]: book.member_holonyms()
Out[20]: []
 
In [21]: book.root_hypernyms()
Out[21]: [Synset('entity.n.01')]
 
In [22]: man = wn.synset('man.n.01')
 
In [23]: man.lemmas()
Out[23]: [Lemma('man.n.01.man'), Lemma('man.n.01.adult_male')]
 
In [24]: man.lemmas()[0]
Out[24]: Lemma('man.n.01.man')
 
In [25]: man.lemmas()[0].antonyms()
Out[25]: [Lemma('woman.n.01.woman')]

You can browse “book.n.01” and “man.n.01” on WordNet Online.

Word Similarity Interface by WordNet

In [43]: cat = wn.synset('cat.n.01')
 
In [44]: dog = wn.synset('dog.n.01')
 
In [45]: man = wn.synset('man.n.01')
 
In [46]: woman = wn.synset('woman.n.01')
 
In [47]: hit = wn.synset('hit.v.01')
 
In [48]: kick = wn.synset('kick.v.01')
 
In [49]: cat.path_similarity(cat)
Out[49]: 1.0
 
In [50]: cat.path_similarity(dog)
Out[50]: 0.2
 
In [51]: man.path_similarity(woman)
Out[51]: 0.3333333333333333
 
In [52]: hit.path_similarity(kick)
Out[52]: 0.3333333333333333
 
In [53]: cat.lch_similarity(dog)
Out[53]: 2.0281482472922856
 
In [54]: man.lch_similarity(woman)
Out[54]: 2.538973871058276
 
In [55]: hit.lch_similarity(kick)
Out[55]: 2.159484249353372
 
In [56]: cat.wup_similarity(dog)
Out[56]: 0.8571428571428571
 
In [57]: man.wup_similarity(woman)
Out[57]: 0.6666666666666666
 
In [58]: hit.wup_similarity(kick)
Out[58]: 0.6666666666666666

You can browse “cat.n.01“, “dog.n.01“, “man.n.01“, “woman.n.01“, “hit.v.01” and “kick.v.01” on WordNet Online.

Reference:
Getting started with WordNet by Text Processing
WordNet Interface by NLTK
WordNet and ImageNet
Open Multilingual Wordnet
Wordnet with NLTK
Tutorial: What is WordNet? A Conceptual Introduction Using Python
Dive into WordNet with NLTK

Comments