Chapter 2: Regular Expressions, Text Normalization, Edit Distance

Important Terms

If you see a number in parentheses with a pound sign preceding it, it's an important term.

chatbots
- simple, pattern-based programs that mimic human conversation
regular expression (RE)
- an algebraic notation for characterizing a set of strings
text normalization
- converting natural text to a form you can analyze
tokenization
- separating text into words or groups of words that have meaning together (e.g. intersection, rock and roll, New York, separating I'm into I am)
lemmatization
- mapping different forms of a word back to the stem word (e.g. recognizing that sing, sang, sung, singing, and sings all refer back to sing)
- you can do this using a lemmatizer
stemming
- crude lemmatization: just taking the suffixes off of words
sentence segmentation
- dividing a text into sentences
corpus (plural corpora)
- a computer-readable document or a collection of documents you want to analyze text from
anchors
- special characters that hold down a string's position in a line
lookahead assertions
- when you want to peer ahead in text without moving forward
zero-width
- when the match cursor doesn't advance but an operation is still being performed, such as lookahead assertions
disfluency
- any sort of break or stoppage of regular language in spoken language
fragment
- a disfluency in the form of an uncompleted clause or word
fillers
- a disfluency in the form of "um" or "uh" or anything similar
lemma
- the root word that all forms of a word branch off of (ex. say to said, says, saying)
vocabulary
- the number of distinct words in a corpus, denoted as V
wordtypes
- |vocabulary|
tokens
- the total number of words, regardless of vocabulary
clitic contractions
- parts of a word that can't stand by themselves (changing we're to we are)
case folding
- converting everything to lowercase to aid speech translation or information gleaning
coreference
- deciding whether two strings refer to the same entity
minimum edit distance
- the minimum number of changes it takes to get from one string to another
backtrace
- starting from the result of dynamic programming and working your way backwards

Chapter Notes

I. Introduction

A. you can use chatbots to accomplish everyday tasks, but you need to normalize the target text using these 3 steps:

1) sentence segmentation (#7)

2) tokenization (#4)

3) lemmatization (#5)

B. then you need to compare what you have

II. Regular Expressions

A. basic patterns

1) basic searching

def: returns instances of the string

/woodchuck/

The woodchuck chucks wood.

2) disjunctions

[]: at least one of the characters within

/[wW]oodchuck/

Woodchuck colonies worship a solitary woodchuck.

| : either of the strings on either of the symbol

/cat|dog/

I have a cat and a dog.

3) ranges

def: returns anything in the range designated by the -

/[A-Z]/

Apple, Banana, Cream... (any capital letter)

4) exceptions

def: when in brackets, does not return characters following ^

/[^Ss]/ (^ will only negate the following characters with brackets, so /a^b/ would literally return the phrase "a^b". also, /[e^]/ will return either "e" or literally "^")

Super scientific sound search sought

5) optional characters

def: will return text with or without the character preceding the ?

/colou?r/

In America, it's spelled color and in Britain, it's spelled colour.

/woodchucks?/

The woodchucks worship a sole woodchuck.

6) the Kleene *

def: returns 0 or more of the characters preceding *

/a*/ (0 or more "a"s)

a, aaaaaa, Kleene

/[ab]*/

aaaa, bbbb, abab, xylophone

7) the Kleene +

def: returns 1 or more of the characters preceding +

/[0-9]+/ (1 or more digits)

The professor wrote 19 on the board 9 times.

8) wildcard

def: returns any instance of the characters surrounding the . regardless of what the . is replaced with

/beg.n/

beg'n, begun, began, begin

/aardvark.*aardvark/ (if aardvark appears twice in any line)

The aardvark went to the other aardvark.

9) anchors (#9)

def: searches for something in a particular place in a line

$: the end of a line

/[tT]he dog\.$/ (the \. refers to a literal period and not a wildcard)

Lorem ipsum dolor sit amet.

The dog.

Sample text.

\b: a word boundary (only when there's whitespace around the query)

/\bthe\b/

I think the other day was better.

\B: a word non-boundary (the opposite)

/\Bthe\B/

I think the other day was better.

10) precedence

def: just like in math, parenthetical commands take precedence

bad example:

/guppy|ies/

guppy gupster ies

/gupp(y|ies)/

I have one guppy and she has two guppies.

11) curly brackets

def: denotes how many occurrences of the target string

/happy{3}/

The happy happy happy happy boy ate cake.

/happy{2,4}/ (specifies the range)

The happy happy happy happy happy boy ate cake.

B. substitutions

1) basic substitutions

def: replace first string with second string

s/colour/color

Nice colour. --> Nice color.

2) capture groups

def: the thing that you're replacing being stored into memory, then denoted by \1, \2, \3... depending on which the computer saw first

s/([0-9]+)/<\1>/

35 boxes of 20934 bobbleheads each in 1 truck. --> <35> boxes of <20934> bobbleheads each in <1> truck.

/[tT]he (.*)er they were, the \1er they will be/

The bigger they were, the bigger they will be. The smaller they are, the bigger they will be. The faster they ran, the faster they ate.

3) non-capturing groups

?:: essentially, a sign that says "skip this capture group"

/(?:[sS]ome|[aA] few) (people|cats|) like \1/

A few people like cats. A few cats like cats. A few cats like a few people.

4) lookahead assertions (#10)

?=: looks ahead for the target string but doesn't advance the match cursor (zero-width)

?!: negative lookahead, opposite of ?=

/(^?!Volcano)[A-Za-z]+/

--- will match any words at the beginning of a line that don't start with "Volcano" ---

III. Words and Corpora

A. in speech analysis, punctuation marks can be treated as words because they're equally as important

ex) "I do uh main- mainly business data processing."

B. disfluencies (#12) can tell when a speaker is unsure, shifting an idea, or restarting a clause

C. if the vocabulary of a corpus is V, and the number of tokens is N, then according to Herdan's Law or Heaps' Law,

|V| = k(N^b) where k > 0 and 0 < b < 1

(b generally depends on corpus size and genre)

IV. Text Normalization

A. the steps:

1) segmentation/tokenization of words from running text

2) normalization of word formats

3) sentence segmentation in running text

B. crude Unix commands

1) tr

def: changes particular characters systematically

2) sort

def: sorts input alphabetically

3) uniq

def: collapses and counts adjacent identical lines

ex) tr -sc 'A-Za-z' '\n' < sh.txt

will print every word from corpus on a newline (\n)

tr -sc 'A-Za-z' '\n' < sh.txt | sort | uniq -c

gives a count of each word alphabetically (-c makes sure it only works for alphabetical elements)

tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | uniq -c

gives a count of all words but all of them are lowercase (tr A-Z a-z), so you can see how many times each word appears

tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r

gives a count in reversing order (-r) of numerical occurrences (using -n on the fact that the list has a numerical column of occurrences

C. word tokenization and normalization

1) sometimes, you can perform a sort of lookup so you can catch even words like "rock and roll" and "New York" with named entity detection

2) commonly, we'll use the Penn Treebank Tokenization standard

separates clitics (#19), punctuation, and keeps hyphenated words together

3) keeps entities together, so that "USA" and "US" mean the same thing

4) tokenization needs to fast and efficient without many errors

5) of course, doing that is easier said than done, and you need to see whether you can deal with ambiguities, such as distinguishing between quotations and apostrophes

D. MaxMatch

1) for Mandarin, it starts at the beginning of a sentence and tries to make the longest word it can with the characters, but if none are found, then it only advances one character

ex) 他特别喜欢北京烤鸭

他 (he) 特别 (especially) 喜欢 (likes) 北京烤鸭 (Beijing duck)

2) for English, it tries to make the longest word out of the letters present in the string until it can't, then it starts from where it left off

ex) we can only see a short distance ahead

we canon l y see ash ort distance ahead

3) as you can see, this works much better for Chinese than for English, because the words in Chinese are much shorter

4) word error rate

def: edit distance between MaxMatch result and the right answer

5) sequence models work the best, but they require machine learning and such

E. lemmatization and stemming

ex) He is reading detective stories --> He be read detective story

1) morphological parsing

def: dividing a word into its stem and its affixes (general word for suffixes and prefixes)

ex) fox --> fox

ex) cats --> cat, -s (plural)

ex) amaren --> amar, -en (3rd person plural, future subjunctive)

2) the Porter stemmer

def: a type of crude lemmatizer that simply takes off all the affixes

ex) This was not the map we found in Billy Bones's chest, --> Thi wa not the map we found in Billi Bone s chest,

as you can see, it knows that -s is a plural suffix, but it doesn't know that it's simply part of the words this and was

F. sentence segmentation

1) it's relatively simple, because a period will always mark the end of a sentence

2) but what if the period is in an abbreviation like Mr. or Dr.?

3) in order to create a great algorithm, you need machine learning

IV. Minimum Edit Distance

A. if a user types in graffe, we assume that they misspelled giraffe because there's only a difference of one letter

B. it's certainly more plausible than grafe or grief, both of which are off by more than a letter

C. coreference (#21)

D. edit distance will help us decide coreference among other things because it tells us in a number how similar two strings are

1) minimum edit distance (#22)

ex) intention to execution

I N T E * N T I O N

| | | | | | | | | |

* E X E C U T I O N

d s s i s (d - deletion, s - swap, i - insertion)

by starting execution off by a character, we can line up tion at the end of the word, and by putting an empty space (*) in the middle of intention, we can line up an additional e

E. using a backtrace (#23), you can figure out how much programming load is needed to complete a certain

Google Sites

Report abuse