word-sense-disambiguation

So, before we talk about word sense disambiguation, let's talk about words, and the meanings of words.

"Not bad meaning bad, but bad meaning good!" -- Run DMC, "Peter Piper"

What is the meaning of a word?

The situation with words and meanings of words is pretty complicated. In the one-language case it's interesting enough...

But the thing to notice is that the relationships between words and meanings is differently complicated, across languages.

So first let's look at relationships between words and meanings in English...

"One practical technique for determining if two senses are distinct is to conjoin two uses of a word in a single sentence; this kind of conjunction of antagonistic readings is called zeugma." (J&M, page 613).

"Which of those flights serve breakfast?"

"Does Midwest Express serve Philadelphia?"

"? Does Midwest Express serve breakfast ... and Philadelphia?"

So you would really want to say that these are two different senses of serve.

"She left in a huff and a Hyundai."

Let's check out some wiktionary entries...

http://en.wiktionary.org/wiki/book#Noun

http://en.wiktionary.org/wiki/bomb

http://en.wiktionary.org/wiki/take

(I had no idea about "take a pitch" ...)

http://en.wiktionary.org/wiki/endure

http://en.wiktionary.org/wiki/call

http://en.wiktionary.org/wiki/explode

("call a function...") -- can you use that meaning outside of a programming context? ...

Verbs that have a lot of different contextual meanings are often called light verbs, and to understand what's going on, you really have to look at the context.

Some examples of light verbs:

- take, give, make, do, have ...

"Have a cow", "have a bagel", "have a conniption", "have a child" ...

"do a barrel roll", "do drugs", "do the math", "do the hustle" ...

Then there are a lot of verbs that aren't light verbs, but they have different meanings based on the nearby preposition/"particle".

Compiled by my friend from college, Mr. A.P. Saulters: http://www.apsaulters.net/pv.html

Relevant terms we should nail down...

lemma is the citation form. We want to say that like "found", "find", "finds", maybe "finding" all have the same lemma.

But lemmatization can be ambiguous: consider found vs found.

So what's a word sense? That's a discrete meaning that we want to isolate from some other meanings. How fine-grained we make these depends on the task.

homonym: two wordforms that have different meanings, basically no semantic connection. That's not always the relationship between word-senses, though. You could also have...

polysymy: when there's some sort of relationship between the meanings. A good example is like "blood bank" and "financial bank", or even "the building a financial bank lives in". (the latter is an instance of metonymy).

synonyms, antonyms... can be hard to tell apart!

hypernyms (aka superordinates), hyponyms: super-class/subclass relationships. Some words are more general than one another.

meronymy: part/whole relationship

What are the senses of a word?

Really, the distinctions that you want to make are dependent on what you want to do...

What are some places you could get senses of a word, say if you want to start disambiguating words? ...

- dictionary?

- WordNet? (what is wordnet?)

- ... another language? ....

How to represent the meaning of a word, computationally?

It turns out:

- it depends on what you want to do!

- but it can be pretty complicated.

What if you want to do translation? ...

Moreover: different languages describe the world differently

Are meanings consistent across languages? They're just not.

Translation would be really, really easy if you could just look up the word for X in the target language, even if you had to rearrange them to get the right syntax. You just say things differently, in different languages.

Some terms that we want to nail down:

lexical divergence: several different interpretations of a word from the source language, going into target language (really this means there are many senses of the source word)

lexical gap: there's a term in one language, but no really satisfactory way to say it in your target language.

Sort of a straightforward example, this example is due to Hutchins and Somers. English and French are pretty similar languages, and they've had a whole lot of contact over the years. And yet...

Some more examples:

movement verbs...

English has a bunch of these. In English you can...

- bolt

- mosey

- storm

- wander

- amble

- jog

- trot ...

... and these are just verbs about a human being moving. You could also float, fly, slither...

In Spanish (for example), it's actually more common to say where you were going: instead of floating into the room, you "entró flotando", typically.

Just saw this one today: "abandonar el escenario intempestivamente" --> "storm off the stage".

Semi-related. Anecdote; in Mexico in 2002, alexr was playing video games with his study-abroad host family. And in the Spider-Man video game, there was a button for "dar una patada", but apparently "patear" is also OK sometimes...

Cut/break verbs

There's been a lot of work on verbs about cutting/breaking/opening, because it's really interesting how much languages differ in their treatment of these. Majid et al showed 61 videos to native speakers of 28 different languages.

The videos were things like:

- tearing cloth into pieces by hand

- breaking a stick over a knee

- slicing a carrot into multiple pieces with a knife

- taking the top off a pen

- opening a book

- making an incision in a melon with a knife ...

Some languages have quite a lot of these verbs: some have very few. They give an extreme example in the paper, there's a language that apparently only has two! (did you cut the thing with the grain, or against the grain?)

(table shamelessly stolen from Majid et al)

What are some ways you could get different senses of a word?

There's the mono-lingual sense, where you want to tell the difference between dictionary entries.

There's the cross-lingual sense, say if you're trying to do translation, or if you don't have a good lexicon for the language you care about...

http://www.qwantz.com/index.php?comic=1490

http://www.qwantz.com/index.php?comic=1491

A little more on WSD.

Where do the senses come from?

If you're not doing translation, a pretty good place to get senses from is WordNet!

Let's talk about WordNet

WordNet is this beautiful free database of English lexical information, from Princeton.

http://wordnet.princeton.edu/

http://en.wikipedia.org/wiki/WordNet

Started by the 7 +- 2 guy, who just recently passed away :-\

http://en.wikipedia.org/wiki/George_Armitage_Miller (he was 92)

It's actually from the cognitive science community. The original goal was to organize words and meanings in a way that's cognitively plausible, so that things that seem more similar to people are closer together in the network.

The one from Princeton is for English.

What's in it?

All kinds of stuff!

Like we mentioned before, there's hypernyms (superclasses), hyponyms (subclasses), holonyms (building is a holonym of window), meronyms ((window is a meronym of building) ...

The basic unit in WordNet is what's called the "synset", which is a set of synonyms that share the same basic meaning.

What's not in it?

Etymologies, translations.

Usage examples.

How to access WordNet?

There are a bunch of APIs you can use. You could write your own and parse the freely-redistributable files... but maybe it's better to use an API for the programming language you want to use.

Let's use the API that comes with NLTK!

What's NLTK?

NLTK has lots of interesting stuff in it. We'll use some of it this semester, but not all of it. Not everything in NLTK works with Python 3 yet, which is unfortunate; some people are working on it. alexr should help more.

from nltk.corpus import wordnet as wn

wn.synsets("building")

Neat! Now we can look at all of the things on the synsets...

Lemmatizing with WordNet.

The chair was playing the slap bass in the building.

OK, so why do you want to do this?

>>> from nltk import stem

>>> wnl = stem.WordNetLemmatizer()

>>> wnl.lemmatize('aardwolves')

'aardwolf

'>>> from nltk.corpus import wordnet as wn

>>> aw = wn.synsets('aardwolf)[0]

>>> hypernym = lambda s: s.hypernyms()

>>> list(aw.closure(hypernym))

[Synset('hyena.n.01'), Synset('canine.n.02'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

synset.part_holonyms() to get the things that contain this thing. part_meronyms to get the things it contains...

>>> building = wn.synsets("building")[0]

>>> building

Synset('building.n.01')

>>> building.definition

'a structure that has a roof and walls and stands more or less permanently in one place'

>>> building.part_holonyms

>>> building.part_holonyms()

[]

>>> building.part_meronyms()

[Synset('corner.n.03'), Synset('window.n.01'), Synset('cullis.n.01'), Synset('shaft.n.08'), Synset('heating_system.n.01'), Synset('court.n.10'), Synset('floor.n.02'), Synset('interior_door.n.01'), Synset('room.n.01'), Synset('crawlspace.n.01'), Synset('anteroom.n.01'), Synset('exterior_door.n.01'), Synset('skeleton.n.04'), Synset('upstairs.n.01'), Synset('corner.n.11'), Synset('wall.n.01'), Synset('stairway.n.01'), Synset('elevator.n.01'), Synset('annex.n.01'), Synset('roof.n.01'), Synset('cornerstone.n.02'), Synset('cornerstone.n.03'), Synset('foundation_stone.n.01'), Synset('scantling.n.01')]

So what could you Wordnet for?

>>> for word in "this sentence has a bunch of words".split():

... lemmatized = wnl.lemmatize(word)

... print len(wn.synsets(lemmatized))

At the dumbest baseline, it's a thesaurus.

It's an ontology! It gives you a tree of concepts. What are things you might want to do with that? Given two words, it tells you how similar they are -- you can take distances through the network.

You could use it for targeting ads! (surely this is the most important NLP task of all?)

Maybe if somebody's browsing a web page about aardwolves, you want to show them an ad about hyenas! Or vice-versa!

I would argue that if you could pick out the correct wordnet sense for every word in a sentence, you have a really good model of what that sentence means. You could probably translate the heck out of that sentence. Assuming you have some notion of how to translate each of those synsets...

This is sort of hard, though -- wordnets aren't necessarily available for every language. And if they exist, they might be expensive and have weird licenses.

Kind of surprisingly, even for Spanish, only a sample of the full wordnet is publicly available. (Through this package called FreeLing)

http://en.wikipedia.org/wiki/WordNet#Related_projects_and_extensions

Thinking a little bit more about WSD.

You could use a dictionary: Lesk Algorithm

HOW TO TELL PINE CONES FROM ICE-CREAM CONES

There's actually the original Lesk algorithm, and then Simplified Lesk, which seems to be a bit more effective.

Simplified Lesk algorithm just counts up the words in the surrounding context that are present in the gloss and the examples of a given sense. Shall we try to do this live?

Simplified Lesk goes like this:

best_sense = most frequent sense

max_overlap = 0

context = set of the words in the sentence

for sense in senses:

signature = set of words in the gloss and the examples for sense

overlap = compute_overlap(signature, context)

if overlap > max_overlap:

max_overlap = overlap

best_sense = sense

return best_sense

Do we see any problems with this? How could we fix them?

Original Lesk algorithm compared the target signature with the signature of all of the context words...

We could also consider that some words are more informative than other words -- this weights every word as equally important, and that might not help.

You could train a classifier, do some supervised learning...

What features might you use? Words from the surrounding context? Maybe the n most common words, most discriminative words? Maybe the wordnet synsets of the surrounding words?

How do you evaluate WSD quality?

Well, first you have to measure how well you're doing.

There are two ways that you should consider. Firstly, you can do the INTRINSIC or "in vitro" evaluation. That's just a measure of how often you're making the right decision.

Secondly, if you're building a real NLP system that integrates WSD, you should have a way to measure how well *that* thing is doing. This is the EXTRINSIC evaluation, or "in vivo". Typically your goal is not actually to do WSD for its own sake, but as a component of a larger system.

Floors and ceilings

As a floor for intrinsic evaluation, if you're not beating the baseline of "always assume the most frequent sense", then you're not doing very well.

Something you should consider is that some senses of words are just more common than one another.

Your WSD system, if it's not sure about a decision, should typically back off to the MFS.

Another consideration is that it's possible that your senses are too fine-grained, and maybe people don't even agree about the word senses. If you take your test set and show it to more than one person, and the different people disagree, then you probably can't hope to do any better than the proportion of the time that different people disagree about word senses. Unless you've got some argument for why some of your raters are just wrong.

http://en.wikipedia.org/wiki/Inter-rater_reliability

References

Jurafsky and Martin, chapters 19 and 20

Notes from Coursera's NLP class: http://spark-public.s3.amazonaws.com/nlp/slides/sem.pdf

Notes from Mike:

http://www.cs.indiana.edu/classes/b651/Notes/meaning.html

http://www.cs.indiana.edu/classes/b651/Notes/senses.html

http://cognition.clas.uconn.edu/~jboster/articles/majid_etal.pdf