An Application: Python, Stanford Named Entity Recognizer
How to set up the Stanford Named Entity Recognizer
NER models
Included with Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Time, Location, Organization, Person, Money, Percent, Date
Also available, as part of a package of caseless models for several of our tools, are caseless versions of these same three models.
The caseless models have been installed with the others, so Red Hen has these classifiers available:
tna@cartago:/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16$ l classifiers/*gz
-rw-r--r-- 1 tna tna 27160028 Nov 12 2013 classifiers/english.all.3class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 24732086 Jun 16 00:46 classifiers/english.all.3class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 22115850 Nov 12 2013 classifiers/english.conll.4class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 18350357 Jun 16 00:46 classifiers/english.conll.4class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 20019633 Nov 12 2013 classifiers/english.muc.7class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 17824631 Jun 16 00:46 classifiers/english.muc.7class.distsim.crf.ser.gz
-rw-r--r-- 1 tna tna 20430544 Nov 12 2013 classifiers/english.nowiki.3class.caseless.distsim.crf.ser.gz
-rw-rw-r-- 1 tna tna 18954462 Jun 16 00:46 classifiers/english.nowiki.3class.distsim.crf.ser.gz
The different classifiers will have different levels of accuracy, which should be evaluated.
NER instructions
>>> from nltk.tag.stanford import NERTagger
>>> st = NERTagger(
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/classifiers/english.all.3class.distsim.crf.ser.gz',
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/stanford-ner.jar', 'utf-8')
The first argument is a classifier, the second the main jar file, and the third the encoding of the training data—always utf-8.
>>> text = "The University of California professor downloaded the Stanford NER tagger in Rome."
>>> st.tag(text.split())
[(u'The', u'O'), (u'University', u'ORGANIZATION'), (u'of', u'ORGANIZATION'), (u'California', u'ORGANIZATION'),
(u'professor', u'O'), (u'downloaded', u'O'), (u'the', u'O'), (u'Stanford', u'O'), (u'NER', u'O'), (u'tagger',
u'O'), (u'in', u'O'), (u'Rome', u'LOCATION'), (u'.', u'O')]
The Red Hen collection has some files that are properly capitalized (e.g., Democracy Now), and mostly files in caps only. To handle these, we also define a case-insensitive tagger:
>>> ST = NERTagger('/tvspare/software/java/stanford-NER/
stanford-ner-2014-06-16/classifiers/english.all.3class.caseless.distsim.crf.ser.gz',
'/tvspare/software/java/stanford-NER/stanford-ner-2014-06-16/stanford-ner.jar', 'utf-8')
>>> text = text.upper()
>>> ST.tag(text.split())
[(u'THE', u'O'), (u'UNIVERSITY', u'ORGANIZATION'), (u'OF', u'ORGANIZATION'), (u'CALIFORNIA', u'ORGANIZATION'),
(u'PROFESSOR', u'O'), (u'DOWNLOADED', u'O'), (u'THE', u'O'), (u'STANFORD', u'ORGANIZATION'), (u'NER', u'O'),
(u'TAGGER', u'O'), (u'IN', u'O'), (u'ROME', u'LOCATION'), (u'.', u'O')]
We can check the first couple of lines of text in a file for case, and select the tagger, or try it per line—likely extremely slow:
if text.isupper():
ST.tag(text.split())
else:
st.tag(text.split())
It's possible that running the Stanford NER in server mode would be faster.
Below is an example of a script, PartsOfSpeech-MBSP-05.py, that reads a file, parses it, and writes the results. An NER script can quickly be derived from this, would be easily maintainable, and allow us to create scripts that combine different tools. The script uses seg files as input, with sentences already parsed. There are other working scripts in csa@ca:~/Pattern2.6. Modifying this script is the third task.
We could also write a new script for parsing the raw incoming text into sentences. This should be done as a separate script. I expect that NLTK or Pattern2.6 will be adequate for this, and they're fast. I don't think we need Stanford CoreNLP for this, though we could certainly install it too.
NER installation
wget http://www-nlp.stanford.edu/software/stanford-ner-2014-06-16.zip version 3.4
wget http://nlp.stanford.edu/software/stanford-corenlp-caseless-2014-02-25-models.jar
jar -xvf stanford-corenlp-caseless-2014-02-25-models.jar
created: META-INF/
inflated: META-INF/MANIFEST.MF
created: edu/
created: edu/stanford/
created: edu/stanford/nlp/
created: edu/stanford/nlp/models/
created: edu/stanford/nlp/models/lexparser/
inflated: edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz
created: edu/stanford/nlp/models/pos-tagger/
inflated: edu/stanford/nlp/models/pos-tagger/wsj-0-18-caseless-left3words-distsim.tagger
inflated: edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger.props
inflated: edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
inflated: edu/stanford/nlp/models/pos-tagger/wsj-0-18-caseless-left3words-distsim.tagger.props
created: edu/stanford/nlp/models/ner/
inflated: edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.nowiki.3class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.prop
inflated: edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
inflated: edu/stanford/nlp/models/ner/english.nowiki.3class.caseless.distsim.crf.ser.gz
I moved the files in edu/stanford/nlp/models/ner/ into classifiers/.
NER server mode
It does not look like server mode is compatible with NLTK's interface.
- http://nlp.stanford.edu/software/crf-faq.shtml#cc
- http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERServer.html
Define and start the 7-class mixed case server:
$ cp stanford-ner.jar stanford-ner-with-seven-class-classifier.jar
$ jar -uf stanford-ner-with-seven-class-classifier.jar classifiers/english.muc.7class.distsim.crf.ser.gz
$ java -mx500m -cp stanford-ner-with-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer
-loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz -charset utf8 -port 2020 &
Define the 7-class caseless server:
$ cp stanford-ner.jar stanford-ner-with-caseless-seven-class-classifier.jar
$ jar -uf stanford-ner-with-caseless-seven-class-classifier.jar classifiers/
english.muc.7class.caseless.distsim.crf.ser.gz
$ java -mx500m -cp stanford-ner-with-caseless-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer
-loadClassifier classifiers/english.muc.7class.caseless.distsim.crf.ser.gz -charset utf8 -port 2021 &
Define the 3-class mixed case server:
$ cp stanford-ner.jar stanford-ner-with-three-class-classifier.jar
$ jar -uf stanford-ner-with-three-class-classifier.jar classifiers/english.all.3class.distsim.crf.ser.gz
Default output format is "-outputFormat slashTags" -- see http://nlp.stanford.edu/software/crf-faq.shtml#j
$ java -mx500m -cp stanford-ner-with-three-class-classifier.jar edu.stanford.nlp.ie.NERServer
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -charset utf8 -port 2020 &
The server is now started, now separately open a client to it
$ java -cp stanford-ner-with-classifier.jar edu.stanford.nlp.ie.NERServer -port 2020 -client
Input some text and press RETURN to NER tag it, or just RETURN to finish.
President Barack Obama met Fidel Castro at the United Nations in New York.
President/O Barack/PERSON Obama/PERSON met/O Fidel/PERSON Castro/PERSON at/O the/O United/ORGANIZATION
Nations/ORGANIZATION in/O New/LOCATION York/LOCATION ./O
The best companion for server mode appears to be pyner, see demo -- built python-ner_0.1-1_all.deb from github:
>>> import ner
>>> st = ner.SocketNER(host='localhost', port=2020, output_format='slashTags')
>>> ST = ner.SocketNER(host='localhost', port=2021, output_format='slashTags')
>>> st = ner.HttpNER(host='localhost', port=2020) (if using .war files over Tomcat)
>>> st.get_entities("University of California is located in California, United States")
{'LOCATION': ['California', 'United States'],
'ORGANIZATION': ['University of California']}
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Note you have to include output_format='slashTags' or you get {} output.
First session:
- st is using english.all.3class.distsim.crf.ser.gz on port 2020
- ST is using english.all.3class.caseless.distsim.crf.ser.gz on port 2021
Second session:
- st is using english.muc.7class.distsim.crf.ser.gz on port 2020
- ST is using english.muc.7class.caseless.distsim.crf.ser.gz on port 2021
The process may hang when the stanford-ner responds:
Aug 10, 2014 11:25:39 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Or that may not be the hang moment -- some or all of the following cause the script to hang. The thinking with the first line is that if you detect a BOM (byte-order mark), this is a sign that the rest of the line is also corrupt, so convert everything to ASCII. If not, then remove the individual characters, simply because they stall the text processing engines, but keep them in the line variable so they stay in the file:
if re.search("(\xef\xbf\xbd)", text): text = ''.join([x for x in text if ord(x) < 128])
text = str(text).replace('\x00 ','').replace('\xef\xbf\xbd','')
text = str(text).replace('\xf7','').replace('\xc3\xba','').replace('\xb6','').replace('\xa9','')
.replace('\xe2\x99\xaa','')
text = str(text).replace('\xc3\xaf','').replace('\x5c','').replace('\xf1','').replace('\xe1','')
.replace('\xe7','').replace('\xfa','')
text = str(text).replace('\xf3','').replace('\xed','').replace('\xe9','').replace('\xe0','')
To discover the utf8 hex code for a character, copy it to the python module as a variable:
>>> m="á"
>>> m
'\xe1'
It may in addition be possible to adda timeout by editing pyner/ner/util.py -- I added the bolded lines on 2014-08-10 and they don't seem to interfere with normal operations; it's unknown if they cause real timeouts:
try:
s.settimeout(10)
s.connect((host, port))
s.settimeout(None)
yield s
See Python socket connection timeout.
The python script NER-Stanford-01.py talks to the server, and the bash script ca:/usr/local/bin/seg-NER runs the python script, including a check of the file produced.
It looks like the server does not handle more than six scripts running queries before occasionally tapping out with socket errors, but six processes can run all the time with no problems.
How to create a socket server
Python interface to the Stanford Named Entity Recognizer: https://github.com/dat/pyner
wget https://github.com/dat/pyner/archive/master.zip
python setup.py --command-packages=stdeb.command bdist_deb (binary only)
root@cartago:~/software/python/pyner/pyner-master/deb_dist# just install python-ner_0.1-1_all.deb
How to start the NLP socket server engines in a GNU screen session
Name the sessions as follows (the numbers refer to the screen session windows):
0 jobs 1 NER-st 2 NER-ST 3 POS-st 4 POS-ST 5 Frames 6 bash 7 bash
NER-st -- Named Entity Recognition mixed case
cd ~/software/java/stanford-NER/stanford-ner-2014-06-16
java -XX:+UseNUMA -mx500m -cp stanford-ner-with-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer -port 2020 -loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz &
NER-ST -- Named Entity Recognition uppercase
cd ~/software/java/stanford-NER/stanford-ner-2014-06-16
java -XX:+UseNUMA -mx500m -cp stanford-ner-with-caseless-seven-class-classifier.jar edu.stanford.nlp.ie.NERServer -port 2021 -loadClassifier classifiers/english.muc.7class.caseless.distsim.crf.ser.gz &
POS-st -- Parts of Speech English mixed case
cd ~/software/java/stanford-POS/stanford-postagger-2014-06-16
java -XX:+UseNUMA -mx300m -cp stanford-postagger-with-MixCaseModel.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -model edu/stanford/nlp/models/english-left3words-distsim.tagger -sentenceDelimiter newline -tokenize false -charset utf8 -port 9020 &
POS-ST -- Parts of Speech English uppercase
cd ~/software/java/stanford-POS/stanford-postagger-2014-06-16
java -XX:+UseNUMA -mx300m -cp stanford-postagger-with-caselessModel.jar edu.stanford.nlp.tagger.maxent.MaxentTaggerServer -model edu/stanford/nlp/models/english-caseless-left3words-distsim.tagger -sentenceDelimiter newline -tokenize false -charset utf8 -port 9021 &
Catch up
To catch up on missed tasks, first find out what did not get done. Possible failure points:
* Creation of seg files without tags, one sentence per line
This takes place under user csa by the script cc-segment-stories.
The files end up in /mnt/uptv/CAS2/rzhu4/tSegment/upload/Data/Captions/daily/yyyy/yyyy-mm/yyyy-mm-dd
Find any missing days and run the script for those days: cc-segment-stories 4 sweep daily
The script runs back four days on its own, so typically will recover from a crash
* Integration of untagged seg files into the tv trees
This takes place under user tna by the script cc-integrate-rongda-segmentation
It only runs on the day before yesterday, so a long crash might disrupt it
Find any missing days and run the script for those days: cc-integrate-rongda-segmentation 2015-10-05 (this takes an hour or more)
* Creation of missing tags
If the jobs in this section aren't running, the stanford tags (NER_03 and POS_02) will be missing
The script seg-redo will check all the tags and redo any that are missing
Find any missing days and run for each day: for i in `ls -1 *.seg` ; do seg-redo $i ; done
You can run several jobs in parallel to catch up
* Verify
grep '|NER' *.seg should produce lots of results
Cronjobs for NLP scripts
Sentence splitting
45 01 * * * cc-segment-stories 2 sweep daily
55 01 * * * cc-segment-stories 3 sweep daily
05 02 * * * cc-segment-stories 4 sweep daily
Annotate French-language shows
35 07 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_fr $i _FR_ ; done
Annotate German-language shows
35 08 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_de $i _DE_ ; done
Annotate Spanish-language Spanish shows
35 09 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_es $i _ES_ ; done
Annotate Spanish-language US shows
35 10 * * * for i in $( seq 10 -1 1 ) ; do seg-PartsOfSpeech-pattern_es $i _KMEX_ ; done
Frames -- FrameNet tagging
Manual mode when needed:
day
for i in {1..100} ; do seg-FrameNet-Semafor here _ clobber ; sweep - 1 ; done
MBSP -- Memory-Based Shallow Parser
MBSP starts up on its own -- from cc-integrate-rongda-segmentation
15521 ? Ss 49:37 /tvspare/software/python/MBSP-6060/MBSP/mbt/Mbt -C 100 --pidfile=/tmp/mbsp_6061_chunk.pid -S 6061 -s /tvspare/software/python/MBSP-6060/MBSP/models/train.tagchunker.settings
15528 ? Ss 32:53 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -f /tvspare/software/python/MBSP-6060/MBSP/models/em.data -C 100 -m M -k 5 -w 2 -S 6062 --pidfile=/tmp/mbsp_6062_lemma.pid
15572 ? Ss 0:18 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -d IL -a 0 --pidfile=/tmp/mbsp_6063_relation.pid -C 100 -m M -L 2 -i /tvspare/software/python/MBSP-6060/MBSP/models/train.instancebase -k 19 -w 1 -v s -S 6063
15586 ? Ss 0:03 /tvspare/software/python/MBSP-6060/MBSP/timbl/Timbl -d IL -f /tvspare/software/python/MBSP-6060/MBSP/models/pp.instances -C 100 -m M -L 2 -k 11 -w 0 +v di+db -S 6064 --pidfile=/tmp/mbsp_6064_preposition.pid
Sentiment tagger python script
Sentiment-02.py
#!/usr/bin/python -W ignore
#
# Master in /home/csa/Pattern2.6/
#
# This script reads a .seg file and parses each caption line for sentiment (polarity and subjectivity).
#
# First, it scores each sentence using the native sentiment detector in Pattern2.6
# https://github.com/clips/pattern and http://www.clips.ua.ac.be/pages/pattern-en#sentiment
#
# 20140710235636.492|20140710235648.904|SMT_01|0.1|0.45|different|0.0|0.6|very|0.2|0.3
# Start time|End time|Primary tag|Sentence polarity|Sentence subjectivity(|Word|Word polarity|Word subjectivity)*
#
# Second, it scores the words present in the SentiWordNet dictionary.
# http://sentiwordnet.isti.cnr.it/ and http://sentiwordnet.isti.cnr.it/docs/SWNFeedback.pdf
# cartago:/usr/share/pyshared/pattern/text/en/wordnet/SentiWordNet_3.0.0_20130122.txt
#
# 20140720230042.157|20140720230050.866|SMT_02|HOLLYWOOD|0.0|0.0|MOURNING|-0.625|0.625|MAVERICK|0.375|0.375|OVER|0.375
# Start time|End time|Primary tag(|Word|Word polarity|Word subjectivity)*
#
# Third, it scores the sentence using TextBlob 0.9-dev PatternAnalyzer sentiment detection, which may have a better lexicon
# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced -- we seem to get the same scores as SMT_01
#
# 20140720230042.157|20140720230050.866|SMT_03|0.5|0.5
# Start time|End time|Primary tag|Sentence polarity|Sentence subjectivity
#
# Fourth, it scores the sentence using TextBlob 0.9-dev NaiveBayesAnalyzer, an NLTK classifier trained on a movie reviews corpus
# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced -- these scores are very different
#
# 20140720230042.157|20140720230050.866|SMT_04|0.927279087353
# Start time|End time|Primary tag|Sentence positivity
#
# Direct the output to overwrite the input .seg file (using sponge) or to a new .smt file
# for FIL in `ls -1 /db/tv/2014/2014-07/2014-07-11/*seg` ; do python SentiWordNet.py $FIL > ${FIL%.*}.smt ; done
#
# Not all of these need to be activated at once -- in July-August 2014 we ran the first and second.
#
# Can be called by cc-integrate-rongda-segmentation and seg-Sentiment
#
# Written by FFS, 2014-07-18
#
# Changelog:
#
# 2014-11-09 Fixed SMT_02
# 2014-08-13 Set #!/usr/bin/python -W ignore to turn off Unicode warnings
# 2014-08-02 Added sentiment detection SMT_03 and 04 from TextBlob
# 2014-08-02 Forked from CLiPS-03.py for a pure sentiment script
# 2014-07-27 Learned enough python to control the format, logic improved
# 2014-05-18 First version SentiWordNet.py, poor sentiment output format
#
# --------------------------------------------------------------------------------------------------
# User input
import sys, os.path
scriptname = os.path.basename(sys.argv[0])
filename = sys.argv[1]
# Help screen
if filename == "-h" :
print "".join([ "\n","\t","This is a production script for sentiment detection -- issue:","\n" ])
print "".join([ "\t","\t","python ",scriptname," $FIL.seg > $FIL.smt" ])
print "".join([ "\n","\t","or use the seg-Sentiment bash script for bulk processing.","\n" ])
quit()
# Libraries
import nltk, datetime, re
# For Pattern2.6 native sentiment detection
# http://www.clips.ua.ac.be/pages/pattern-en#sentiment
from pattern.en import sentiment
# For the SentiWordNet dictionary
# http://www.clips.ua.ac.be/pages/pattern-en#wordnet
from pattern.en import wordnet
from pattern.en import ADJECTIVE
# For TextBlob PatternAnalyzer sentiment detection
# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced
from textblob import TextBlob
# For TextBlob NLTK sentiment detection
# http://textblob.readthedocs.org/en/dev/advanced_usage.html#advanced
from textblob.sentiments import NaiveBayesAnalyzer
# Counter
n = 0
# A. Get the lines from the file
with open(filename) as fp:
for line in fp:
# B. Split each line into fields
field = line.split("|")
# Pretty debug
# print('\n'.join('{}: {}'.format(*k) for k in enumerate(field)))
# C. Header and footer
if len(field[0]) != 18:
print line,
continue
# D. Program credit
if n == 0:
credit=["SMT_01|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=Pattern 2.6, ",scriptname,"|Source_Person=Tom De Smedt, FFS|Codebook=polarity, subjectivity"]
print "".join(credit)
credit=["SMT_02|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=SentiWordNet 3.0, ",scriptname,"|Source_Person=Andrea Esuli, FFS|Codebook=polarity, subjectivity"]
print "".join(credit)
# credit=["SMT_03|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=TextBlob 0.9-dev, ",scriptname,"|Source_Person=Steven Loria, FFS|Codebook=polarity, subjectivity"]
# print "".join(credit)
# credit=["SMT_04|",datetime.datetime.now().strftime("%Y-%m-%d %H:%M"),"|Source_Program=TextBlob 0.9-dev, NLTK, ",scriptname,"|Source_Person=Steven Loria, FFS|Codebook=positivity"]
# print "".join(credit)
n=1
# E. Write segment tags and other non-caption lines
if field[2] == "SEG":
print line,
continue
elif len(field[2]) != 3:
print line,
continue
# F. Get the text, clean leading chevrons -- if BOM, strip non-ascii, otherwise remove individually
try:
text = re.sub('^[>,\ ]{0,6}','', field[3])
if re.search("(\xef\xbf\xbd)", text): text = ''.join([x for x in text if ord(x) < 128])
text = str(text).replace('\x00 ','').replace('\xef\xbf\xbd','')
text = str(text).replace('\xf7','').replace('\xc3\xba','').replace('\xb6','').replace('\xa9','').replace('\xe2\x99\xaa','')
text = str(text).replace('\xc3\xaf','').replace('\x5c','').replace('\xf1','').replace('\xe1','').replace('\xe7','').replace('\xfa','')
text = str(text).replace('\xf3','').replace('\xed','').replace('\xe9','').replace('\xe0','').replace('\xae','').replace('\xc2','')
text = str(text).replace('\xc3','').replace('\xa2','').replace('\xbf','').replace('\xd1','').replace('\xb1','').replace('\xe2','')
text = str(text).replace('\xb7','').replace('\xad','').replace('\xb0','').replace('\x84','').replace('\xf8','').replace('\xa1','')
text = str(text).replace('\xa4','').replace('\xf6','').replace('\x89','').replace('\xa6','').replace('\xa7','').replace('\x96','')
text = str(text).replace('\xe4','').replace('\xd9','').replace('\x91','').replace('\xcd','').replace('\xda','').replace('\xeb','')
text = str(text).replace('\xa6','').replace('\xdc','').replace('\xb3','').replace('\xa7','')
# print text
except IndexError:
print line
continue
# G. Remove clearly wrong unicode characters -- BOM, NULL (only utf8 hex works)
line = str(line).replace('\x00 ','').replace('\xef\xbf\xbd','')
print line,
snt = ""
smt = ""
terms = ""
smt3 = ""
smt4 = ""
# G. You could split the text into sentences, but it breaks the adjacency of text and tags
# http://www.clips.ua.ac.be/pages/pattern-en
# tokenize(string, punctuation=".,;:!?()[]{}`''\"@#$^&*+-|=~_", replace={})
# from pattern.en import tokenize
# for sentence in tokenize(text):
# H. Pattern2.6 built-in sentiment detection by sentence and words
snt = sentiment(text)
for tup in sentiment(text).assessments:
words = " ".join(tup[0])
terms = "".join([terms,"|",words,"|",str(tup[1]),"|",str(tup[2])])
if snt != "": print "".join([field[0],"|",field[1],"|SMT_01|",str(snt[0]),"|",str(snt[1]),terms])
# I. Word loop for the SentiWordNet dictionary
try:
for word in nltk.word_tokenize(text):
try:
weight = wordnet.synsets(word, ADJECTIVE)[0].weight
smt = "".join([smt,"|",word,"|",str(weight[0]),"|",str(weight[1])])
except (UnicodeDecodeError, UnicodeEncodeError, IndexError, AssertionError): pass
if smt != "": print "".join([field[0],"|",field[1],"|SMT_02",smt])
except (UnicodeDecodeError, UnicodeEncodeError): continue
continue
# J. TextBlob default PatternAnalyzer sentiment detection
try:
smt3 = TextBlob(text)
# Sentiment(polarity=0.13636363636363635, subjectivity=0.5)
except UnicodeDecodeError: continue
if smt3 != "": print "".join([field[0],"|",field[1],"|SMT_03|",str(smt3.sentiment[0]),"|",str(smt3.sentiment[1])])
# K. TextBlob NLTK sentiment detection
try:
smt4 = TextBlob(text, analyzer=NaiveBayesAnalyzer())
# Sentiment(classification='pos', p_pos=0.9272790873528883, p_neg=0.07272091264711199)
except UnicodeDecodeError: continue
if smt4 != "": print "".join([field[0],"|",field[1],"|SMT_04|",str(smt4.sentiment[1])])
# L. Close the file
fp.close()
# EOF
Sentiment tagger bash script
This script calls the python script, feeds it a file at a time, and checks that the output is sane.
#!/bin/bash
#
# /usr/local/bin/seg-sentiment
#
# Written by FFS 2014-07-18 -- development in csa@ca:~/pattern2.6
#
# Changelog
#
# 2014-07-28 Add sentence-level sentiment annotation
#
#---------------------------------------------------------------------
# Parsing script and primary tag (note the parsescript currently adds SMT_01 and SMT_02)
ParseScript=Sentiment-02.py
PTAG=SMT_01
SCRIPT=`basename $0`
# Help screen
if [ "$1" = "-h" -o "$1" = "--help" -o "$1" = "help" ]
then echo -e "\n\t$SCRIPT [<date> or #] [<partially matching filename>] [clobber]\n"
echo -e "\tDetect parts of speech and sentiment in seg files using CLiPS MBSP 1.4 and Pattern2.6.\n"
echo -e "\tExamples:\n"
echo -e "\tProcess the .seg files from seven days ago:\n"
echo -e "\t\t$SCRIPT 7 \n"
echo -e "\tProcess only Aljazeera files from a given date, removing any pre-existing SMT codes:\n"
echo -e "\t\t$SCRIPT 2006-12-28 Aljazeera clobber\n"
echo -e "\tUse a for loop to process a series of days (see daysago) -- _ matches all files:\n"
echo -e "\t\tfor d in {3701..2} ; do seg-sentiment \$d _ clobber ; done\n"
echo -e "\tThe files are processed in /tv, where they will be sync'd.\n"
echo -e "\tCodebook:\n"
echo -e "\t20140710235636.492|20140710235648.904|SMT_01|0.1|0.45|different|0.0|0.6|very|0.2|0.3"
echo -e "\tStart time|End time|Primary tag|Sentence polarity|Sentence subjectivity(|Word|Word polarity|Word subjectivity)*\n"
echo -e "\t20140720230042.157|20140720230050.866|SMT_02|HOLLYWOOD|0.0|0.0|MOURNING|-0.625|0.625|MAVERICK|0.375|0.375|OVER|0.375"
echo -e "\tStart time|End time|Primary tag(|Word|Word polarity|Word subjectivity)*\n"
exit
fi
# Get the date to work on (today's date may change while the script is running)
if [ "$1" = "here" ] ; then DAY="$( pwd )" DAY=${DAY##*/}
elif [ -z "$1" ] ; then DAY="$(date +%F)"
elif [ "$(echo $1 | grep '[^0-9]')" = "" ] # if the first parameter is an integer
then DAY="$(date -d "-$1 day" +%F)"
else DAY="$1"
fi
# Sanity check
if [ "$(date -d "$DAY" 2>/dev/null)" = "" ]
then echo -e "\n\t$SCRIPT [<date> or #] [<partially matching filename>] [clobber]\n" ; exit
fi
# Partial file name?
if [ -z "$2" ]
then NAM="_"
else NAM="$2"
fi ; NAM=""$DAY"*"$NAM""
# Generate the date-dependent portion of the path
DDIR="$(date -d "$DAY" +%Y)/$(date -d "$DAY" +%Y-%m)/$(date -d "$DAY" +%F)"
# Base source, and target directories
if [ "$HOST" = "roma" ] ; then SDIR=/tv/$DDIR ; else SDIR=/tv/$DDIR ; fi
# Define the trouble log and e-mail recipients
FAILED="/tmp/$SCRIPT.log"
TO=`cat /usr/local/bin/e-mail`
# Temporary file extension
EXT=smt
# File counter
NUM=0
# Welcome
echo -e "\n\tSentiment detection with $ParseScript in seg files on $DAY at $( date )\n"
# Process each video file in turn
for FIL in $( ls -1 $SDIR/$NAM*.seg 2> /dev/null ); do
# Strip path and extension
FIL=${FIL##*/} FIL=${FIL%.*}
# Skip non-English
LAN=$( grep ^'LAN|' $FIL.seg ) ; if [ -n "$LAN" ] ; then if [ "${LAN#*|}" != "ENG" ] ; then continue ; fi ; fi
# Skip KMEX and PT
if [[ $FIL == *_KMEX_* || $FIL == *_PT_* ]] ; then continue ; fi
# Check for existing $PTAG tags
if [ "$( egrep ^"$PTAG" $SDIR/$FIL.seg )" != "" ] ; then
if [ "$3" = "clobber" ] ; then
# Tweak as needed to identify the version considered up to date
if [ "$( egrep -m1 $ParseScript $SDIR/$FIL.seg | grep $PTAG 2>/dev/null )" != "" ]
then echo -e "\t\t$ParseScript annotation $PTAG present in $FIL.seg -- skipping" ; continue
else echo -en "\t\tRe-annotating $PTAG with $ParseScript $FIL.seg"
sed -i "/${PTAG%_*}/d" $SDIR/$FIL.seg
fi
else echo -e "\t\t$ParseScript $PTAG completed in $FIL.seg" ; continue
fi
else echo -en "\t\tAnnotating $PTAG with $ParseScript $FIL.seg"
fi
# Get the size of the seg file
S0="$( stat --format=%s $SDIR/$FIL.seg 2>/dev/null )"
# Background the annotation process to catch hangs
$ParseScript $SDIR/$FIL.seg > $SDIR/$FIL.$EXT &
# Get the PID and the start time of the annotation
PID=$! AGE="$( date +%s )"
# Wait a few seconds for the temporary file to start growing
n=0 ; while [ ! -s $SDIR/$FIL.$EXT -a $n -lt 200 ] ; do sleep 0.2 ; n=$[n+1] ; done ; S1=0 S2=0 n=0
# Terminate if the file stops growing (see margin counter for files that enter this loop)
while ps -p $PID > /dev/null ; do
S1="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )" ; sleep 1 ; NOW="$(date +%s)" ; LASTED="$[NOW-$AGE]"
S2="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )" n=$[n+1] ; tput cr ; echo -n $n
if [ "$S1" -eq "$S2" -a $n -gt 30 ] ; then kill $PID ; rm -f $SDIR/$FIL.$EXT
echo -e "\n\t`date +%F\ %H:%M` \t${SCRIPT%.*} \tNo grow \t$LASTED secs \t$FIL.seg" | tee -a $FAILED.$( date +%F )
fi
done ; LASTED="$[NOW-$AGE]" DAY=$( date +%F ) YAY=$( date -d "-1 day" +%F )
# Check the temporary file
S1="$( stat --format=%s $SDIR/$FIL.$EXT 2>/dev/null )"
if [ "$S1" -gt "$S0" -a "$( tail -n1 $SDIR/$FIL.$EXT | grep END )" != "" ]
then mv $SDIR/$FIL.$EXT $SDIR/$FIL.seg ; echo
else echo -e "\t\tFailed annotation -- please check $FIL.seg" | tee -a $FAILED.$( date +%F ) ; continue
fi
NUM=$[NUM+1]
done
# Sanity check
if [ "$FIL" = "" ] ; then echo -e "\tUse a partially matching file name -- leave out the date and the extension.\n" ; exit ; fi
# Receipt
echo -e "\n\tCompleted scoring sentiments in $NUM seg files in $SDIR\n"
# EOF