Python
Resources
Bash to python
Pattern
- Pattern for Red Hen Lab
- https://github.com/clips/pattern -- supports python3.6 and 2.7
Machine learning
- pattern.vector
- pyTorch
- theano
Forced aligner
- 2011-06-05 Synchronizing transcripts with video (main reference, vr wiki)
- P2FA -- Penn Phonetics Lab Forced Aligner
- P2FA instructions
Praat
- praat-py -- extension to Praat that allows scripts to be written in Python (should work with python 2.7, but maybe not with most recent versions of praat)
- P2TK -- Penn Phonetics Toolkit
- NLTK TextGrid (python parser for praat format -- see examples)
Python
- A Python primer
- The Python Tutorial
- Python Data Analysis Library (pandas)
- python-rpy2 -- provides interface to R from Python (py-rpy2 in macports) (see instructionsbelow)
- python-beautifulsoup -- error-tolerant HTML parser
- python-feedparser -- parses a bunch of feeds
- python-tz -- timezones
- PyQtGraph -- Scientific Graphics and GUI Library for Python
Reading
History
import readline
for i in range(readline.get_current_history_length()):
print readline.get_history_item(i)
Modules
Alias
from nltk.corpus.reader import framenet as fn
Module version
>>> nltk.__version__
'3.0.0b1'
Dict tuple list string
Dict entities have keywords and entries
>>> myDict = st.get_entities(sentence)
{u'ORGANIZATION': [u'University of California'], u'LOCATION': [u'California', u'United States'],
u'O': [u'is located in', u',']}
To unfold, use .items():
>>> for tup in myDict.items():
... print tup
...
(u'ORGANIZATION', [u'University of California'])
(u'LOCATION', [u'California', u'United States'])
(u'O', [u'is located in', u','])
Expand a list
Each element in myDict is a tuple; each tuple in turn is a unicode string and a list:
>>> type(tup[0])
<type 'unicode'>
>>> type(tup[1])
<type 'list'>
To expand that list:
print " ".join(tup[1])
Convert a string s to a tuple t
t = (s,)
Convert tuples to list -- one tuple at a time
l = list(t[0])
l1 = l[0:2] # where to start : how many
Convert tuples to strings -- all at once
s = str(t)
Split a tuple into a list
field = filename.split("_")
Assign directly and count words
text = line.split("|")[3]
WC = WC + len(text.split())
Assign by subtraction
fdate = field.pop(0)
fhour = field.pop(0)
country = field.pop(0)
network = field.pop(0)
show = field
Strip a newline from a string
sentence_sub = fields.pop(0).rstrip()
Split string into tuples (chunks) of 3
fields = ["can't", '-0.1', '0.1', 'modern', '0.2', '0.3']
zip(*[fields[i::3] for i in range(3)])
[("can't", '-0.1', '0.1'), ('modern', '0.2', '0.3')]
Concatenate strings
text += str(nltk.tag.tuple2str(tagged_text[x]) + " ")
Join a list with an underscore or space as delimiter
show = "_".join(field)
phrase = " ".join(tup[0]) # If tub[0] only contains one word, no space is added
print "".join([stem,",SMT_01",",",str(tup[0]),",",str(tup[1]),",",str(tup[2]).rstrip()]) # strip newline
Joint a list, keeping the UTF-8 encoding, and replacing spaces with pipe symbols
snt = parse(text, lemmata=True, relations=True)
text = re.sub('\ ', '|', snt)
if snt != "": print u"".join([field[0],"|",field[1],"|POS_03|",text]).encode('utf-8').strip()
Check for substring in string
if "SMT_" not in line: continue
Data type
>>> type(fff)
<class 'nltk.corpus.reader.framenet.PrettyList'>
>>> type(fff[0])
<class 'nltk.corpus.reader.framenet.AttrDict'>
Dates
Modules
Time now
datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
Replace
Remove parens from a string s
import re
re.sub('[()]', '', s)
Remove brackets and single quotes from tuple t -- all elements
b = str(t).replace('[','').replace(']','').replace("'",'')
Clean up unicode that halts stanford-ner, MBSP, et al
text = re.sub('^[>,\ ]{0,6}', '', field[3])
text = str(text).replace('\x00 ','').replace('\xef\xbf\xbd', '').replace('\xb6','').replace('\xa9','')
text = str(text).replace('\xc3\xaf', '').replace('\x5c','').replace('\xf1','').replace('\xe2\x99\xaa','')
Loops
Get lines from file in utf-8
import codecs
with codecs.open(filename,encoding='utf8') as fp:
for line in fp:
print line.encode('utf8')
Or without specifying character encoding
with open(filename) as fp:
for line in fp:
# Split each line into fields
field = line.split("|")
# Pretty debug
print('\n'.join('{}: {}'.format(*k) for k in enumerate(field)))
For loop -- get separate tuples from the parse function and make a replacement in the first element before printing
tups = sentiment(line).assessments
for tup in tups:
a = str(tup[0]).replace('[','').replace(']','').replace("'",'')
smt = [a," (",str(tup[1]),", ",str(tup[2]),")"]
print "".join(smt),
print "\r"
Re-written to fluent python (see SentiWordNet.py):
for tup in sentiment(sentence).assessments:
words = " ".join(tup[0])
terms = "".join([terms,",",words,",",str(tup[1]),",",str(tup[2])])
print terms
Catch failure
try:
sentence = field[3]
except IndexError:
print line
continue
Endless
while True:
If statements
Test string and string length
if field[2] == "SEG":
print line,
continue
elif len(field[2]) != 3:
print line,
continue
Compound conditions with parens
if ( network == 'CampaignAds' ) or ( network == 'Shooters' ) or ( network == 'DigitalEphemera' ): continue
Substring test
if "POS_01" in line:
Test empty string
if not myString:
if myString == "":
Skip if the line is empty
if text.strip() == '' : continue
Regular expressions
import re
Types
re.match
re.search
re.sub
Skip header files
if not re.match("([0-9]{14})", line): continue
Teletext page
if re.search("(\|[0-9]{3}\|)", line)
Retrieve matching substring within a text
>>> a=str(fn.frame(114))
>>> b = re.search("Core\:\ [A-Za-z_]*\ \([0-9]{1,6}\)", a)
>>> b.group()
'Core: Hypothetical_event (563)'
>>> if b: print b.group()
You can turn it into a one-liner for a single hit:
COR = re.search("Core\:\ \w*\ \(\d*\)", str(fn.frame(ID))).group()
Ingest old-style SMT_ lines from .seg files
line=' very, serious (-0.33, 0.67) failure (-0.316666666667, 0.3)'
pattern = '\ ?[a-zA-Z]{1,99},?\ ?[a-zA-Z]{1,99}?\ \(-?\d\.[0-9]{1,12},\ -?\d\.[0-9]{1,12}\)'
for match in re.finditer(pattern, line):
s = match.start()
e = match.end()
print (line[s:e])
very, serious (-0.33, 0.67)
failure (-0.316666666667, 0.3)
Or more compact:
for match in re.finditer(pattern, line):
print (line[match.start():match.end()])
Faster than grep in file fp -- output all lines
term = "ALASKA"
print ''.join((line for line in fp if term in line))
Fast egrep in file fp
pat = re.compile("^([A-Z][0-9]+)*$")
print sum(1 for line in fp if pat.search(line))
print ''.join((line for line in fp if pat.search(line))
Output to file
Either use the 2.7 method (a for append or w for overwrite):
file=open('filename.txt','w')
file.write('some text')
Or the backported 3.x method (a for append or w for overwrite):
from __future__ import print_function
with open('filename.txt', 'a') as f:
print("hi there", file=f)
Print without newline
print "hi",
-- but this adds a trailing space. To avoid it:
import sys
for i in range(10): sys.stdout.write("*")
sys.stdout.write("\n")
rpy2
- rpy2 modules -- table of contents
- Robjects
- Graphics examples (very helpful)
Test installation from commandline
python -m 'rpy2.robjects.tests.__init__'