An Application: Python, Stanford Named Entity Recognizer

How to set up the Stanford Named Entity Recognizer

NER models

Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class Location, Person, Organization

4 class Location, Person, Organization, Misc

7 class Time, Location, Organization, Person, Money, Percent, Date

Also available, as part of a package of caseless models for several of our tools, are caseless versions of these same three models.

The caseless models have been installed with the others, so Red Hen has these classifiers available:

tna@cartago:/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16$ l classifiers/*gz
-rw-r--r-- 1 tna tna 27160028 Nov 12  2013 classifiers/english.all.3class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 24732086 Jun 16 00:46 classifiers/english.all.3class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 22115850 Nov 12  2013 classifiers/english.conll.4class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 18350357 Jun 16 00:46 classifiers/english.conll.4class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 20019633 Nov 12  2013 classifiers/english.muc.7class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 17824631 Jun 16 00:46 classifiers/english.muc.7class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 20430544 Nov 12  2013 classifiers/english.nowiki.3class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 18954462 Jun 16 00:46 classifiers/english.nowiki.3class.distsim.crf.ser.gz

The different classifiers will have different levels of accuracy, which should be evaluated.

NER instructions

>>> from nltk.tag.stanford import NERTagger
>>> st = NERTagger(
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/classifiers/english.all.3class.distsim.crf.ser.gz', 
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/stanford-ner.jar', 'utf-8')

The first argument is a classifier, the second the main jar file, and the third the encoding of the training data—always utf-8.

>>> text = "The University of California professor downloaded the Stanford NER tagger in Rome."
>>> st.tag(text.split())
[(u'The', u'O'), (u'University', u'ORGANIZATION'), (u'of', u'ORGANIZATION'), (u'California', u'ORGANIZATION'),
(u'professor', u'O'), (u'downloaded', u'O'), (u'the', u'O'), (u'Stanford', u'O'), (u'NER', u'O'), (u'tagger',
u'O'), (u'in', u'O'), (u'Rome', u'LOCATION'), (u'.', u'O')]

The Red Hen collection has some files that are properly capitalized (e.g., Democracy Now), and mostly files in caps only. To handle these, we also define a case-insensitive tagger:

>>> ST = NERTagger('/tvspare/software/java/stanford-NER/
stanford-ner-2014-06-16/classifiers/english.all.3class.caseless.distsim.crf.ser.gz',
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/stanford-ner.jar', 'utf-8')
>>> text = text.upper()
>>> ST.tag(text.split())
[(u'THE', u'O'), (u'UNIVERSITY', u'ORGANIZATION'), (u'OF', u'ORGANIZATION'), (u'CALIFORNIA', u'ORGANIZATION'),
(u'PROFESSOR', u'O'), (u'DOWNLOADED', u'O'), (u'THE', u'O'), (u'STANFORD', u'ORGANIZATION'), (u'NER', u'O'),
(u'TAGGER', u'O'), (u'IN', u'O'), (u'ROME', u'LOCATION'), (u'.', u'O')]

We can check the first couple of lines of text in a file for case, and select the tagger, or try it per line—likely extremely slow:

if text.isupper():
  ST.tag(text.split())
else:
 st.tag(text.split())

It's possible that running the Stanford NER in server mode would be faster.

Below is an example of a script, PartsOfSpeech-MBSP-05.py, that reads a file, parses it, and writes the results. An NER script can quickly be derived from this, would be easily maintainable, and allow us to create scripts that combine different tools. The script uses seg files as input, with sentences already parsed. There are other working scripts in csa@ca:~/Pattern2.6. Modifying this script is the third task.

We could also write a new script for parsing the raw incoming text into sentences. This should be done as a separate script. I expect that NLTK or Pattern2.6 will be adequate for this, and they're fast. I don't think we need Stanford CoreNLP for this, though we could certainly install it too.

NER installation

wget http://www-nlp.stanford.edu/software/stanford-ner-2014-06-16.zip version 3.4
wget http://nlp.stanford.edu/software/stanford-corenlp-caseless-2014-02-25-models.jar
jar -xvf stanford-corenlp-caseless-2014-02-25-models.jar
 
 created: META-INF/
inflated: META-INF/MANIFEST.MF
 created: edu/
 created: edu/stanford/
 created: edu/stanford/nlp/
 created: edu/stanford/nlp/models/
 created: edu/stanford/nlp/models/lexparser/
inflated: edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz
 created: edu/stanford/nlp/models/pos-tagger/
inflated: edu/stanford/nlp/models/pos-tagger/wsj-0-18-caseless-left3words-distsim.tagger
inflated: edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger.props
inflated: edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
inflated: edu/stanford/nlp/models/pos-tagger/wsj-0-18-caseless-left3words-distsim.tagger.props
 created: edu/stanford/nlp/models/ner/
inflated: edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.nowiki.3class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.nowiki.3class.caseless.distsim.crf.ser.gz

I moved the files in edu/stanford/nlp/models/ner/ into classifiers/.

NER server mode

It does not look like server mode is compatible with NLTK's interface.

Define and start the 7-class mixed case server:

$ cp stanford-ner.jar stanford-ner-with-seven-class-classifier.jar
$ jar -uf stanford-ner-with-seven-class-classifier.jar classifiers/english.muc.7class.distsim.crf.ser.gz
$ java -mx500m -cp stanford-ner-with-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer
 -loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz -charset utf8 -port 2020 &

Define the 7-class caseless server:

$ cp stanford-ner.jar stanford-ner-with-caseless-seven-class-classifier.jar
$ jar -uf stanford-ner-with-caseless-seven-class-classifier.jar classifiers/
  english.muc.7class.caseless.distsim.crf.ser.gz
$ java -mx500m -cp stanford-ner-with-caseless-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer
 -loadClassifier classifiers/english.muc.7class.caseless.distsim.crf.ser.gz -charset utf8 -port 2021 &

Define the 3-class mixed case server:

$ cp stanford-ner.jar stanford-ner-with-three-class-classifier.jar
$ jar -uf stanford-ner-with-three-class-classifier.jar classifiers/english.all.3class.distsim.crf.ser.gz

Default output format is "-outputFormat slashTags" -- see http://nlp.stanford.edu/software/crf-faq.shtml#j

$ java -mx500m -cp stanford-ner-with-three-class-classifier.jar edu.stanford.nlp.ie.NERServer
 -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -charset utf8 -port 2020 &

The server is now started, now separately open a client to it

$ java -cp stanford-ner-with-classifier.jar edu.stanford.nlp.ie.NERServer -port 2020 -client
Input some text and press RETURN to NER tag it, or just RETURN to finish.
President Barack Obama met Fidel Castro at the United Nations in New York.
President/O Barack/PERSON Obama/PERSON met/O Fidel/PERSON Castro/PERSON at/O the/O United/ORGANIZATION
Nations/ORGANIZATION in/O New/LOCATION York/LOCATION ./O

The best companion for server mode appears to be pyner, see demo -- built python-ner_0.1-1_all.deb from github:

>>> import ner
>>> st = ner.SocketNER(host='localhost', port=2020, output_format='slashTags')
>>> ST = ner.SocketNER(host='localhost', port=2021, output_format='slashTags')
>>> st = ner.HttpNER(host='localhost', port=2020) (if using .war files over Tomcat)
>>> st.get_entities("University of California is located in California, United States")
{'LOCATION': ['California', 'United States'],
'ORGANIZATION': ['University of California']}
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'

Note you have to include output_format='slashTags' or you get {} output.

First session:

  • st is using english.all.3class.distsim.crf.ser.gz on port 2020
  • ST is using english.all.3class.caseless.distsim.crf.ser.gz on port 2021

Second session:

  • st is using english.muc.7class.distsim.crf.ser.gz on port 2020
  • ST is using english.muc.7class.caseless.distsim.crf.ser.gz on port 2021

The process may hang when the stanford-ner responds:

Aug 10, 2014 11:25:39 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

Or that may not be the hang moment -- some or all of the following cause the script to hang. The thinking with the first line is that if you detect a BOM (byte-order mark), this is a sign that the rest of the line is also corrupt, so convert everything to ASCII. If not, then remove the individual characters, simply because they stall the text processing engines, but keep them in the line variable so they stay in the file:

 if re.search("(\xef\xbf\xbd)", text): text = ''.join([x for x in text if ord(x) < 128])
 text = str(text).replace('\x00 ','').replace('\xef\xbf\xbd','')
 text = str(text).replace('\xf7','').replace('\xc3\xba','').replace('\xb6','').replace('\xa9','')
  .replace('\xe2\x99\xaa','')
 text = str(text).replace('\xc3\xaf','').replace('\x5c','').replace('\xf1','').replace('\xe1','')
  .replace('\xe7','').replace('\xfa','')
 text = str(text).replace('\xf3','').replace('\xed','').replace('\xe9','').replace('\xe0','') 

To discover the utf8 hex code for a character, copy it to the python module as a variable:

 >>> m="á"
 >>> m
 '\xe1'

It may in addition be possible to adda timeout by editing pyner/ner/util.py -- I added the bolded lines on 2014-08-10 and they don't seem to interfere with normal operations; it's unknown if they cause real timeouts:

   try:
       s.settimeout(10)
       s.connect((host, port))
       s.settimeout(None)
       yield s

See Python socket connection timeout.

The python script NER-Stanford-01.py talks to the server, and the bash script ca:/usr/local/bin/seg-NER runs the python script, including a check of the file produced.

It looks like the server does not handle more than six scripts running queries before occasionally tapping out with socket errors, but six processes can run all the time with no problems.

How to create a socket server

Python interface to the Stanford Named Entity Recognizer: https://github.com/dat/pyner

wget https://github.com/dat/pyner/archive/master.zip

python setup.py --command-packages=stdeb.command bdist_deb (binary only)

root@cartago:~/software/python/pyner/pyner-master/deb_dist# just install python-ner_0.1-1_all.deb


How to start the NLP socket server engines in a GNU screen session

Name the sessions as follows (the numbers refer to the screen session windows):

0 jobs 1 NER-st 2 NER-ST 3 POS-st 4 POS-ST 5 Frames 6 bash 7 bash

NER-st -- Named Entity Recognition mixed case

cd ~/software/java/stanford-NER/stanford-ner-2014-06-16

java -XX:+UseNUMA -mx500m -cp stanford-ner-with-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer -port 2020 -loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz &

NER-ST -- Named Entity Recognition uppercase

cd ~/software/java/stanford-NER/stanford-ner-2014-06-16

java -XX:+UseNUMA -mx500m -cp stanford-ner-with-caseless-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer -port 2021 -loadClassifier classifiers/english.muc.7class.caseless.distsim.crf.ser.gz &

POS-st -- Parts of Speech English mixed case

cd ~/software/java/stanford-POS/stanford-postagger-2014-06-16

java -XX:+UseNUMA -mx300m -cp stanford-postagger-with-MixCaseModel.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -model edu/stanford/nlp/models/english-left3words-distsim.tagger -sentenceDelimiter newline -tokenize false -charset utf8 -port 9020 &

POS-ST -- Parts of Speech English uppercase

cd ~/software/java/stanford-POS/stanford-postagger-2014-06-16

java -XX:+UseNUMA -mx300m -cp stanford-postagger-with-caselessModel.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -model edu/stanford/nlp/models/english-caseless-left3words-distsim.tagger -sentenceDelimiter newline -tokenize false -charset utf8 -port 9021 &

Catch up

To catch up on missed tasks, first find out what did not get done. Possible failure points:

* Creation of seg files without tags, one sentence per line

This takes place under user csa by the script cc-segment-stories.

The files end up in /mnt/uptv/CAS2/rzhu4/tSegment/upload/Data/Captions/daily/yyyy/yyyy-mm/yyyy-mm-dd

Find any missing days and run the script for those days: cc-segment-stories 4 sweep daily

The script runs back four days on its own, so typically will recover from a crash

* Integration of untagged seg files into the tv trees

This takes place under user tna by the script cc-integrate-rongda-segmentation

It only runs on the day before yesterday, so a long crash might disrupt it

Find any missing days and run the script for those days: cc-integrate-rongda-segmentation 2015-10-05 (this takes an hour or more)

* Creation of missing tags

If the jobs in this section aren't running, the stanford tags (NER_03 and POS_02) will be missing

The script seg-redo will check all the tags and redo any that are missing

Find any missing days and run for each day: for i in `ls -1 *.seg` ; do seg-redo $i ; done

You can run several jobs in parallel to catch up

* Verify

grep '|NER' *.seg should produce lots of results

Cronjobs for NLP scripts

Sentence splitting

45 01 * * * cc-segment-stories 2 sweep daily

55 01 * * * cc-segment-stories 3 sweep daily

05 02 * * * cc-segment-stories 4 sweep daily

Annotate French-language shows

35 07 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_fr $i _FR_ ; done

Annotate German-language shows

35 08 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_de $i _DE_ ; done

Annotate Spanish-language Spanish shows

35 09 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_es $i _ES_ ; done

Annotate Spanish-language US shows

35 10 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_es $i _KMEX_ ; done

Frames -- FrameNet tagging


Manual mode when needed:

day

for i in {1..100} ; do seg-FrameNet-Semafor here _ clobber ; sweep - 1 ; done

MBSP -- Memory-Based Shallow Parser

MBSP starts up on its own -- from cc-integrate-rongda-segmentation

15521 ? Ss 49:37 /tvspare/software/python/MBSP-6060/MBSP/mbt/Mbt -C 100 --pidfile=/tmp/mbsp_6061_chunk.pid -S 6061 -s /tvspare/software/python/MBSP-6060/MBSP/models/train.tagchunker.settings

15528 ? Ss 32:53 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -f /tvspare/software/python/MBSP-6060/MBSP/models/em.data -C 100 -m M -k 5 -w 2 -S 6062 --pidfile=/tmp/mbsp_6062_lemma.pid

15572 ? Ss 0:18 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -d IL -a 0 --pidfile=/tmp/mbsp_6063_relation.pid -C 100 -m M -L 2 -i /tvspare/software/python/MBSP-6060/MBSP/models/train.instancebase -k 19 -w 1 -v s -S 6063

15586 ? Ss 0:03 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -d IL -f /tvspare/software/python/MBSP-6060/MBSP/models/pp.instances -C 100 -m M -L 2 -k 11 -w 0 +v di+db -S 6064 --pidfile=/tmp/mbsp_6064_preposition.pid


Sentiment tagger python script

Sentiment-02.py

#!/usr/bin/python -W ignore

#

# Master in /home/csa/Pattern2.6/

#

# This script reads a .seg file and parses each caption line for sentiment (polarity and subjectivity).

#

# First, it scores each sentence using the native sentiment detector in Pattern2.6

# https://github.com/clips/pattern and http://www.clips.ua.ac.be/pages/pattern-en#sentiment

#

# 20140710235636.492|20140710235648.904|SMT_01|0.1|0.45|different|0.0|0.6|very|0.2|0.3

# Start time|End time|Primary tag|Sentence polarity|Sentence subjectivity(|Word|Word polarity|Word subjectivity)*

#

# Second, it scores the words present in the SentiWordNet dictionary.

# http://sentiwordnet.isti.cnr.it/ and http://sentiwordnet.isti.cnr.it/docs/SWNFeedback.pdf

# cartago:/usr/share/pyshared/pattern/text/en/wordnet/SentiWordNet_3.0.0_20130122.txt

#

# 20140720230042.157|20140720230050.866|SMT_02|HOLLYWOOD|0.0|0.0|MOURNING|-0.625|0.625|MAVERICK|0.375|0.375|OVER|0.375

# Start time|End time|Primary tag(|Word|Word polarity|Word subjectivity)*

#

# Third, it scores the sentence using TextBlob 0.9-dev PatternAnalyzer sentiment detection, which may have a better lexicon

# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced -- we seem to get the same scores as SMT_01

#

# 20140720230042.157|20140720230050.866|SMT_03|0.5|0.5

# Start time|End time|Primary tag|Sentence polarity|Sentence subjectivity

#

# Fourth, it scores the sentence using TextBlob 0.9-dev NaiveBayesAnalyzer, an NLTK classifier trained on a movie reviews corpus

# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced -- these scores are very different

#

# 20140720230042.157|20140720230050.866|SMT_04|0.927279087353

# Start time|End time|Primary tag|Sentence positivity

#

# Direct the output to overwrite the input .seg file (using sponge) or to a new .smt file

# for FIL in `ls -1 /db/tv/2014/2014-07/2014-07-11/*seg` ; do python SentiWordNet.py $FIL > ${FIL%.*}.smt ; done

#

# Not all of these need to be activated at once -- in July-August 2014 we ran the first and second.

#

# Can be called by cc-integrate-rongda-segmentation and seg-Sentiment

#

# Written by FFS, 2014-07-18

#

# Changelog:

#

# 2014-11-09 Fixed SMT_02

# 2014-08-13 Set #!/usr/bin/python -W ignore to turn off Unicode warnings

# 2014-08-02 Added sentiment detection SMT_03 and 04 from TextBlob

# 2014-08-02 Forked from CLiPS-03.py for a pure sentiment script

# 2014-07-27 Learned enough python to control the format, logic improved

# 2014-05-18 First version SentiWordNet.py, poor sentiment output format

#

# --------------------------------------------------------------------------------------------------


# User input

import sys, os.path

scriptname = os.path.basename(sys.argv[0])

filename = sys.argv[1]


# Help screen

if filename == "-h" :

print "".join([ "\n","\t","This is a production script for sentiment detection -- issue:","\n" ])

print "".join([ "\t","\t","python ",scriptname," $FIL.seg > $FIL.smt" ])

print "".join([ "\n","\t","or use the seg-Sentiment bash script for bulk processing.","\n" ])

quit()


# Libraries

import nltk, datetime, re


# For Pattern2.6 native sentiment detection

# http://www.clips.ua.ac.be/pages/pattern-en#sentiment

from pattern.en import sentiment


# For the SentiWordNet dictionary

# http://www.clips.ua.ac.be/pages/pattern-en#wordnet

from pattern.en import wordnet

from pattern.en import ADJECTIVE


# For TextBlob PatternAnalyzer sentiment detection

# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced

from textblob import TextBlob


# For TextBlob NLTK sentiment detection

# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced

from textblob.sentiments import NaiveBayesAnalyzer


# Counter

n = 0


# A. Get the lines from the file

with open(filename) as fp:

for line in fp:


# B. Split each line into fields

field = line.split("|")


# Pretty debug

# print('\n'.join('{}: {}'.format(*k) for k in enumerate(field)))


# C. Header and footer

if len(field[0]) != 18:

print line,

continue


# D. Program credit

if n == 0:

credit=["SMT_01|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=Pattern 2.6, ",scriptname,"|Source_Person=Tom De Smedt, FFS|Codebook=polarity, subjectivity"]

print "".join(credit)

credit=["SMT_02|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=SentiWordNet 3.0, ",scriptname,"|Source_Person=Andrea Esuli, FFS|Codebook=polarity, subjectivity"]

print "".join(credit)

# credit=["SMT_03|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=TextBlob 0.9-dev, ",scriptname,"|Source_Person=Steven Loria, FFS|Codebook=polarity, subjectivity"]

# print "".join(credit)

# credit=["SMT_04|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=TextBlob 0.9-dev, NLTK, ",scriptname,"|Source_Person=Steven Loria, FFS|Codebook=positivity"]

# print "".join(credit)

n=1


# E. Write segment tags and other non-caption lines

if field[2] == "SEG":

print line,

continue

elif len(field[2]) != 3:

print line,

continue


# F. Get the text, clean leading chevrons -- if BOM, strip non-ascii, otherwise remove individually

try:

text = re.sub('^[>,\ ]{0,6}','', field[3])

if re.search("(\xef\xbf\xbd)", text): text = ''.join([x for x in text if ord(x) < 128])

text = str(text).replace('\x00 ','').replace('\xef\xbf\xbd','')

text = str(text).replace('\xf7','').replace('\xc3\xba','').replace('\xb6','').replace('\xa9','').replace('\xe2\x99\xaa','')

text = str(text).replace('\xc3\xaf','').replace('\x5c','').replace('\xf1','').replace('\xe1','').replace('\xe7','').replace('\xfa','')

text = str(text).replace('\xf3','').replace('\xed','').replace('\xe9','').replace('\xe0','').replace('\xae','').replace('\xc2','')

text = str(text).replace('\xc3','').replace('\xa2','').replace('\xbf','').replace('\xd1','').replace('\xb1','').replace('\xe2','')

text = str(text).replace('\xb7','').replace('\xad','').replace('\xb0','').replace('\x84','').replace('\xf8','').replace('\xa1','')

text = str(text).replace('\xa4','').replace('\xf6','').replace('\x89','').replace('\xa6','').replace('\xa7','').replace('\x96','')

text = str(text).replace('\xe4','').replace('\xd9','').replace('\x91','').replace('\xcd','').replace('\xda','').replace('\xeb','')

text = str(text).replace('\xa6','').replace('\xdc','').replace('\xb3','').replace('\xa7','')

# print text

except IndexError:

print line

continue


# G. Remove clearly wrong unicode characters -- BOM, NULL (only utf8 hex works)

line = str(line).replace('\x00 ','').replace('\xef\xbf\xbd','')

print line,

snt = ""

smt = ""

terms = ""

smt3 = ""

smt4 = ""


# G. You could split the text into sentences, but it breaks the adjacency of text and tags

# http://www.clips.ua.ac.be/pages/pattern-en

# tokenize(string, punctuation=".,;:!?()[]{}`''\"@#$^&*+-|=~_", replace={})

# from pattern.en import tokenize

# for sentence in tokenize(text):


# H. Pattern2.6 built-in sentiment detection by sentence and words

snt = sentiment(text)

for tup in sentiment(text).assessments:

words = " ".join(tup[0])

terms = "".join([terms,"|",words,"|",str(tup[1]),"|",str(tup[2])])

if snt != "": print "".join([field[0],"|",field[1],"|SMT_01|",str(snt[0]),"|",str(snt[1]),terms])


# I. Word loop for the SentiWordNet dictionary

try:

for word in nltk.word_tokenize(text):

try:

weight = wordnet.synsets(word, ADJECTIVE)[0].weight

smt = "".join([smt,"|",word,"|",str(weight[0]),"|",str(weight[1])])

except (UnicodeDecodeError, UnicodeEncodeError, IndexError, AssertionError): pass

if smt != "": print "".join([field[0],"|",field[1],"|SMT_02",smt])

except (UnicodeDecodeError, UnicodeEncodeError): continue


continue


# J. TextBlob default PatternAnalyzer sentiment detection

try:

smt3 = TextBlob(text)

# Sentiment(polarity=0.13636363636363635, subjectivity=0.5)

except UnicodeDecodeError: continue

if smt3 != "": print "".join([field[0],"|",field[1],"|SMT_03|",str(smt3.sentiment[0]),"|",str(smt3.sentiment[1])])


# K. TextBlob NLTK sentiment detection

try:

smt4 = TextBlob(text, analyzer=NaiveBayesAnalyzer())

# Sentiment(classification='pos', p_pos=0.9272790873528883, p_neg=0.07272091264711199)

except UnicodeDecodeError: continue

if smt4 != "": print "".join([field[0],"|",field[1],"|SMT_04|",str(smt4.sentiment[1])])


# L. Close the file

fp.close()


# EOF


Sentiment tagger bash script

This script calls the python script, feeds it a file at a time, and checks that the output is sane.

#!/bin/bash

#

# /usr/local/bin/seg-sentiment

#

# Written by FFS 2014-07-18 -- development in csa@ca:~/pattern2.6

#

# Changelog

#

# 2014-07-28 Add sentence-level sentiment annotation

#

#---------------------------------------------------------------------


# Parsing script and primary tag (note the parsescript currently adds SMT_01 and SMT_02)

ParseScript=Sentiment-02.py

PTAG=SMT_01

SCRIPT=`basename $0`


# Help screen

if [ "$1" = "-h" -o "$1" = "--help" -o "$1" = "help" ]

then echo -e "\n\t$SCRIPT [<date> or #] [<partially matching filename>] [clobber]\n"

echo -e "\tDetect parts of speech and sentiment in seg files using CLiPS MBSP 1.4 and Pattern2.6.\n"

echo -e "\tExamples:\n"

echo -e "\tProcess the .seg files from seven days ago:\n"

echo -e "\t\t$SCRIPT 7 \n"

echo -e "\tProcess only Aljazeera files from a given date, removing any pre-existing SMT codes:\n"

echo -e "\t\t$SCRIPT 2006-12-28 Aljazeera clobber\n"

echo -e "\tUse a for loop to process a series of days (see daysago) -- _ matches all files:\n"

echo -e "\t\tfor d in {3701..2} ; do seg-sentiment \$d _ clobber ; done\n"

echo -e "\tThe files are processed in /tv, where they will be sync'd.\n"

echo -e "\tCodebook:\n"

echo -e "\t20140710235636.492|20140710235648.904|SMT_01|0.1|0.45|different|0.0|0.6|very|0.2|0.3"

echo -e "\tStart time|End time|Primary tag|Sentence polarity|Sentence subjectivity(|Word|Word polarity|Word subjectivity)*\n"

echo -e "\t20140720230042.157|20140720230050.866|SMT_02|HOLLYWOOD|0.0|0.0|MOURNING|-0.625|0.625|MAVERICK|0.375|0.375|OVER|0.375"

echo -e "\tStart time|End time|Primary tag(|Word|Word polarity|Word subjectivity)*\n"

exit

fi


# Get the date to work on (today's date may change while the script is running)

if [ "$1" = "here" ] ; then DAY="$( pwd )" DAY=${DAY##*/}

elif [ -z "$1" ] ; then DAY="$(date +%F)"

elif [ "$(echo $1 | grep '[^0-9]')" = "" ] # if the first parameter is an integer

then DAY="$(date -d "-$1 day" +%F)"

else DAY="$1"

fi


# Sanity check

if [ "$(date -d "$DAY" 2>/dev/null)" = "" ]

then echo -e "\n\t$SCRIPT [<date> or #] [<partially matching filename>] [clobber]\n" ; exit

fi


# Partial file name?

if [ -z "$2" ]

then NAM="_"

else NAM="$2"

fi ; NAM=""$DAY"*"$NAM""


# Generate the date-dependent portion of the path

DDIR="$(date -d "$DAY" +%Y)/$(date -d "$DAY" +%Y-%m)/$(date -d "$DAY" +%F)"


# Base source, and target directories

if [ "$HOST" = "roma" ] ; then SDIR=/tv/$DDIR ; else SDIR=/tv/$DDIR ; fi


# Define the trouble log and e-mail recipients

FAILED="/tmp/$SCRIPT.log"

TO=`cat /usr/local/bin/e-mail`


# Temporary file extension

EXT=smt


# File counter

NUM=0


# Welcome

echo -e "\n\tSentiment detection with $ParseScript in seg files on $DAY at $( date )\n"


# Process each video file in turn

for FIL in $( ls -1 $SDIR/$NAM*.seg 2> /dev/null ); do


# Strip path and extension

FIL=${FIL##*/} FIL=${FIL%.*}


# Skip non-English

LAN=$( grep ^'LAN|' $FIL.seg ) ; if [ -n "$LAN" ] ; then if [ "${LAN#*|}" != "ENG" ] ; then continue ; fi ; fi


# Skip KMEX and PT

if [[ $FIL == *_KMEX_* || $FIL == *_PT_* ]] ; then continue ; fi


# Check for existing $PTAG tags

if [ "$( egrep ^"$PTAG" $SDIR/$FIL.seg )" != "" ] ; then

if [ "$3" = "clobber" ] ; then

# Tweak as needed to identify the version considered up to date

if [ "$( egrep -m1 $ParseScript $SDIR/$FIL.seg | grep $PTAG 2>/dev/null )" != "" ]

then echo -e "\t\t$ParseScript annotation $PTAG present in $FIL.seg -- skipping" ; continue

else echo -en "\t\tRe-annotating $PTAG with $ParseScript $FIL.seg"

sed -i "/${PTAG%_*}/d" $SDIR/$FIL.seg

fi

else echo -e "\t\t$ParseScript $PTAG completed in $FIL.seg" ; continue

fi

else echo -en "\t\tAnnotating $PTAG with $ParseScript $FIL.seg"

fi

# Get the size of the seg file

S0="$( stat --format=%s $SDIR/$FIL.seg 2>/dev/null )"

# Background the annotation process to catch hangs

$ParseScript $SDIR/$FIL.seg > $SDIR/$FIL.$EXT &

# Get the PID and the start time of the annotation

PID=$! AGE="$( date +%s )"

# Wait a few seconds for the temporary file to start growing

n=0 ; while [ ! -s $SDIR/$FIL.$EXT -a $n -lt 200 ] ; do sleep 0.2 ; n=$[n+1] ; done ; S1=0 S2=0 n=0

# Terminate if the file stops growing (see margin counter for files that enter this loop)

while ps -p $PID > /dev/null ; do

S1="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )" ; sleep 1 ; NOW="$(date +%s)" ; LASTED="$[NOW-$AGE]"

S2="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )" n=$[n+1] ; tput cr ; echo -n $n

if [ "$S1" -eq "$S2" -a $n -gt 30 ] ; then kill $PID ; rm -f $SDIR/$FIL.$EXT

echo -e "\n\t`date +%F\ %H:%M` \t${SCRIPT%.*} \tNo grow \t$LASTED secs \t$FIL.seg" | tee -a $FAILED.$( date +%F )

fi

done ; LASTED="$[NOW-$AGE]" DAY=$( date +%F ) YAY=$( date -d "-1 day" +%F )

# Check the temporary file

S1="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )"

if [ "$S1" -gt "$S0" -a "$( tail -n1 $SDIR/$FIL.$EXT | grep END )" != "" ]

then mv $SDIR/$FIL.$EXT $SDIR/$FIL.seg ; echo

else echo -e "\t\tFailed annotation -- please check $FIL.seg" | tee -a $FAILED.$( date +%F ) ; continue

fi

NUM=$[NUM+1]

done

# Sanity check

if [ "$FIL" = "" ] ; then echo -e "\tUse a partially matching file name -- leave out the date and the extension.\n" ; exit ; fi

# Receipt

echo -e "\n\tCompleted scoring sentiments in $NUM seg files in $SDIR\n"


# EOF