Important Terms
If you see a number in parentheses with a pound sign preceding it, it's an important term.
Chapter Notes
I. Introduction
A. you can use chatbots to accomplish everyday tasks, but you need to normalize the target text using these 3 steps:
1) sentence segmentation (#7)
2) tokenization (#4)
3) lemmatization (#5)
B. then you need to compare what you have
II. Regular Expressions
A. basic patterns
1) basic searching
def: returns instances of the string
/woodchuck/
The woodchuck chucks wood.
2) disjunctions
[]: at least one of the characters within
/[wW]oodchuck/
Woodchuck colonies worship a solitary woodchuck.
| : either of the strings on either of the symbol
/cat|dog/
I have a cat and a dog.
3) ranges
def: returns anything in the range designated by the -
/[A-Z]/
Apple, Banana, Cream... (any capital letter)
4) exceptions
def: when in brackets, does not return characters following ^
/[^Ss]/ (^ will only negate the following characters with brackets, so /a^b/ would literally return the phrase "a^b". also, /[e^]/ will return either "e" or literally "^")
Super scientific sound search sought
5) optional characters
def: will return text with or without the character preceding the ?
/colou?r/
In America, it's spelled color and in Britain, it's spelled colour.
/woodchucks?/
The woodchucks worship a sole woodchuck.
6) the Kleene *
def: returns 0 or more of the characters preceding *
/a*/ (0 or more "a"s)
a, aaaaaa, Kleene
/[ab]*/
aaaa, bbbb, abab, xylophone
7) the Kleene +
def: returns 1 or more of the characters preceding +
/[0-9]+/ (1 or more digits)
The professor wrote 19 on the board 9 times.
8) wildcard
def: returns any instance of the characters surrounding the . regardless of what the . is replaced with
/beg.n/
beg'n, begun, began, begin
/aardvark.*aardvark/ (if aardvark appears twice in any line)
The aardvark went to the other aardvark.
9) anchors (#9)
def: searches for something in a particular place in a line
$: the end of a line
/[tT]he dog\.$/ (the \. refers to a literal period and not a wildcard)
Lorem ipsum dolor sit amet.
The dog.
Sample text.
\b: a word boundary (only when there's whitespace around the query)
/\bthe\b/
I think the other day was better.
\B: a word non-boundary (the opposite)
/\Bthe\B/
I think the other day was better.
10) precedence
def: just like in math, parenthetical commands take precedence
bad example:
/guppy|ies/
guppy gupster ies
/gupp(y|ies)/
I have one guppy and she has two guppies.
11) curly brackets
def: denotes how many occurrences of the target string
/happy{3}/
The happy happy happy happy boy ate cake.
/happy{2,4}/ (specifies the range)
The happy happy happy happy happy boy ate cake.
B. substitutions
1) basic substitutions
def: replace first string with second string
s/colour/color
Nice colour. --> Nice color.
2) capture groups
def: the thing that you're replacing being stored into memory, then denoted by \1, \2, \3... depending on which the computer saw first
s/([0-9]+)/<\1>/
35 boxes of 20934 bobbleheads each in 1 truck. --> <35> boxes of <20934> bobbleheads each in <1> truck.
/[tT]he (.*)er they were, the \1er they will be/
The bigger they were, the bigger they will be. The smaller they are, the bigger they will be. The faster they ran, the faster they ate.
3) non-capturing groups
?:: essentially, a sign that says "skip this capture group"
/(?:[sS]ome|[aA] few) (people|cats|) like \1/
A few people like cats. A few cats like cats. A few cats like a few people.
4) lookahead assertions (#10)
?=: looks ahead for the target string but doesn't advance the match cursor (zero-width)
?!: negative lookahead, opposite of ?=
/(^?!Volcano)[A-Za-z]+/
--- will match any words at the beginning of a line that don't start with "Volcano" ---
III. Words and Corpora
A. in speech analysis, punctuation marks can be treated as words because they're equally as important
ex) "I do uh main- mainly business data processing."
B. disfluencies (#12) can tell when a speaker is unsure, shifting an idea, or restarting a clause
C. if the vocabulary of a corpus is V, and the number of tokens is N, then according to Herdan's Law or Heaps' Law,
|V| = k(N^b) where k > 0 and 0 < b < 1
(b generally depends on corpus size and genre)
IV. Text Normalization
A. the steps:
1) segmentation/tokenization of words from running text
2) normalization of word formats
3) sentence segmentation in running text
B. crude Unix commands
1) tr
def: changes particular characters systematically
2) sort
def: sorts input alphabetically
3) uniq
def: collapses and counts adjacent identical lines
ex) tr -sc 'A-Za-z' '\n' < sh.txt
will print every word from corpus on a newline (\n)
tr -sc 'A-Za-z' '\n' < sh.txt | sort | uniq -c
gives a count of each word alphabetically (-c makes sure it only works for alphabetical elements)
tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | uniq -c
gives a count of all words but all of them are lowercase (tr A-Z a-z), so you can see how many times each word appears
tr -sc 'A-Za-z' '\n' < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
gives a count in reversing order (-r) of numerical occurrences (using -n on the fact that the list has a numerical column of occurrences
C. word tokenization and normalization
1) sometimes, you can perform a sort of lookup so you can catch even words like "rock and roll" and "New York" with named entity detection
2) commonly, we'll use the Penn Treebank Tokenization standard
separates clitics (#19), punctuation, and keeps hyphenated words together
3) keeps entities together, so that "USA" and "US" mean the same thing
4) tokenization needs to fast and efficient without many errors
5) of course, doing that is easier said than done, and you need to see whether you can deal with ambiguities, such as distinguishing between quotations and apostrophes
D. MaxMatch
1) for Mandarin, it starts at the beginning of a sentence and tries to make the longest word it can with the characters, but if none are found, then it only advances one character
ex) 他特别喜欢北京烤鸭
他 (he) 特别 (especially) 喜欢 (likes) 北京烤鸭 (Beijing duck)
2) for English, it tries to make the longest word out of the letters present in the string until it can't, then it starts from where it left off
ex) we can only see a short distance ahead
we canon l y see ash ort distance ahead
3) as you can see, this works much better for Chinese than for English, because the words in Chinese are much shorter
4) word error rate
def: edit distance between MaxMatch result and the right answer
5) sequence models work the best, but they require machine learning and such
E. lemmatization and stemming
ex) He is reading detective stories --> He be read detective story
1) morphological parsing
def: dividing a word into its stem and its affixes (general word for suffixes and prefixes)
ex) fox --> fox
ex) cats --> cat, -s (plural)
ex) amaren --> amar, -en (3rd person plural, future subjunctive)
2) the Porter stemmer
def: a type of crude lemmatizer that simply takes off all the affixes
ex) This was not the map we found in Billy Bones's chest, --> Thi wa not the map we found in Billi Bone s chest,
as you can see, it knows that -s is a plural suffix, but it doesn't know that it's simply part of the words this and was
F. sentence segmentation
1) it's relatively simple, because a period will always mark the end of a sentence
2) but what if the period is in an abbreviation like Mr. or Dr.?
3) in order to create a great algorithm, you need machine learning
IV. Minimum Edit Distance
A. if a user types in graffe, we assume that they misspelled giraffe because there's only a difference of one letter
B. it's certainly more plausible than grafe or grief, both of which are off by more than a letter
C. coreference (#21)
D. edit distance will help us decide coreference among other things because it tells us in a number how similar two strings are
1) minimum edit distance (#22)
ex) intention to execution
I N T E * N T I O N
| | | | | | | | | |
* E X E C U T I O N
d s s i s (d - deletion, s - swap, i - insertion)
by starting execution off by a character, we can line up tion at the end of the word, and by putting an empty space (*) in the middle of intention, we can line up an additional e
E. using a backtrace (#23), you can figure out how much programming load is needed to complete a certain