basic-text-processing

Text processing: we'll do it live!!

Later on in the class, we'll get into some more formal methods for making decisions, identifying things, building structures, translating, etc...

But it's important to be able to do the practical day-to-day work! With text files and chopping them up!

Real life working in NLP (or programming in general, really) involves figuring out how to get some data in one format and munge it so that it's in the format that your other tool takes.

Let's show some basic Unix text processing tools, do some demos with pipes. Got a bunch of articles from wikinews. Let's see what we can do with them.

terminology about words

type is an element from the vocabulary

token is a particular instance in the wild.

So if you ask how many words, you might mean either of those things!

How many words are there?

On the web, in English (from the Google n-grams dataset): 13 million or so.

In the entire works of Shakespeare: 31K words. He published about 884k words total.

lemma: "same stem, part of speech, rough word sense"

"Seuss's cat in the hat is different from other cats!"

wordform: the full, inflected "surface form". The type!

Get each word on one line using tr.

$ cat *.txt | tr -sc ’A-Za-z’ ’\n’

Also consider using "sort" and uniq -c.

You can then sort the output of uniq with sort -n , or sort -rn.

Maybe fold upper and lowercase together!

Just add another link in the pipe:

... tr 'A-Z' 'a-z' ...

It turns out that tokenization is slightly more complex than splitting on non-alphabetic characters, even for English.

Sentence splitting is even somewhat involved. When you see a period, is it an abbreviation? What if you're inside quotes?

For tokenizing words in English, you also have to consider contractions... don't, you actually want to split into

This is a problem even in formal writing, in many languages: consider Spanish with the clitics, and mandatory Spanish/French contractions:

dárselo (to give it to somebody)

del: depending on the application, split to "de el " ?

l'ensemble (from French. slightly easier, because you have the apostrophe)

Quick corpus linguistics

What are the most common words in the wikinews articles?

Pick two users from Enron corpus: what are the most common words for them? Are they different?

What are the languages, and programming languages most commonly used by people in the class? use "cut" to demo this.

Make sure we know how to bring up Python.

We'll be using Python 3 in this class!

Do some demos with regular expressions and the Python repl. What do we want to be able to match?

http://spark-public.s3.amazonaws.com/nlp/slides/textprocessingboth.pdf

Sheepspeak!

import re

re.match( PATTERN GOES HERE, text to match)

--> returns a match object.

You can do match.groups() to get back the things that were matched in parenthesis!

Show demo for how to get list of things in a Python module.

Bring up documentation for Python's re.

http://docs.python.org/py3k/library/index.html

Bring up Eliza

http://en.wikipedia.org/wiki/ELIZA (Eliza originally by Joseph Weizenbaum)

python3 elizabot.py irc.soic.indiana.edu "#nlp" eliza "soic+irc"

13:45 <@alexr> eliza: Do you remember the time that I went to the beach in Oklahoma?

13:45 < eliza> You mentioned the time that I went to the beach in Oklahoma

Fix that live.

Let's do just a few more regex examples.

Some functions from Python's re library...

- compile

(when do you compile? what does it buy you?)

- match

Match at the beginning of the string. What if you want the string to match completely?

- search

Match anywhere in the string. How do you write match in terms of search?

(could you do the other one? search in terms of match?)

Talk about anchors for a minute.

Also special character classes...

(and their complements \B \W \D)

- split

Say we want to split on any number of whitespaces, and we didn't know about "strings".split ...

- findall / finditer

- sub

The MLA recommends no longer using "groundhog"! Now the preferred nomenclature is "woodchuck"! But you've got all of these blog posts about groundhogs (now "woodchucks"), and you need to change them... how are you going to do it? I learned, just now, that woodchucks are a specific kind of marmot.

"""

Truly, the groundhog is the most honest of land mammals. Groundhogs are hard-working, sincere, and loyal. My sister

wrote that the North American groundhogs will only invest in green energy power companies, or failing that, those that

purchase Groundhog-carbon-offsets.

"""

references and sources

Jurafsky and Manning's slides from Coursera: http://spark-public.s3.amazonaws.com/nlp/slides/textprocessingboth.pdf

http://gnosis.cx/TPiP/chap3.txt

http://docs.python.org/py3k/library/re.html

http://perldoc.perl.org/perlretut.html#Using-character-classes

http://www.regular-expressions.info/email.html

http://jones.ling.indiana.edu/~mdickinson/10/555/slides/09-regex/09-regex.pdf

Page updated

Google Sites

Report abuse