homework-ideas

These are not the official homeworks for the class. But some of them might become the official homeworks for the class.

build ELIZA with regular expressions.

This is neat because it lets you talk about regexes and groups and the Python facilities for all of this are pretty nice. Also we can discuss the history of NLP/NLU...

An alternative for teaching regular expressions is of course the SPAMLORD homework from nlp-class.org.

implement naïve bayes

This lets us talk about classifiers, good machine learning practice, and is a sane introduction to the general class of things that are classification problems.

You basically have to do an n-grams poetrybot

Why would you even consider having an NLP class without teaching about n-gram language models via sampling from your n-gram model.

The thing about this is that it doesn't talk about smoothing. You could imagine two modes, perhaps: a rhyming/rhythm module that suggests words, for like actually good poetry... you could back off to a model of n-grams over POS tags, or even a PCFG. Or perhaps for unseen n-grams, you just do some sane kind of smoothing.

Homework and a unit on regexes and FSTs?

The obvious thing to do is Mike-style FSTs for morphology; that's interesting because then we can talk about morphology some more.

language id as a classification problem.

There are a few different ways to handle language id. You could use a classifier with words or character-level ngrams as (binary?) features. You could use a character-level markov model, and estimate the probability of each given language generating that string of characters. So for a homework on this, I would probably give the students a trainable classifier, ready to go, and make them generate the features. If ambitious, they could think up more features? (word bigrams, feed in a vocabulary or something)

lessons learned from this would be like: what is a classifier, how do we train it, what's good practice in supervised machine learning, some basic text processing to chop up the text and produce the features...

homograph disambiguation as a classification problem.

This could plug into speech synthesis!! We could use Festival or espeak perhaps, or whatever you can plug pronunciations into. This would let us do problems like "how do I say this acronym"? Is it like EMNLP or like IKEA? See Jurafsky and Martin, around page 256, for this problem. There's also "convert" as noun vs verb, or "invalid", or plenty of others.

vocabulary extraction from wikipedia

Given a category in Wikipedia, grab all of the sub-articles, and also some articles in the super-category (or maybe just a random assortment of wikipedia articles). Use this to extract the most salient technical terms for a field (you know, like Pokémon, or music theory), by means of calculating the tf-idf for n-grams in the pages inside that category. For HARD MODE, can you extract definitions for those terms? Can you determine what other uses those terms might have, outside of that field?

Writer a parser. Everybody likes writing a CKY chart parser, right?

Maybe write a shift-reduce dependency parser? That's good because classifiers again.

Question answering system. This would be a good use of a dependency parser.

Write the decoder for a really simple MT system? Or maybe we could do Mike-style EBMT?