state-of-nlp

Let's talk about NLP.

What is HLT vs NLP vs CL?

These are more or less synonymous these days; it depends mostly on which department you come from. Natural Language Processing roughly means that you came from CS, and computational linguistics means that you're coming at it from Linguistics. The speech processing community, which deals more with audio, came more out of the electrical engineering world. We've started seeing "human language technology" as well.

We can view all of this as a sub-branch of AI. What I mean by any of these is basically is "any time you want to make use of human language in a computer program".

NLP generally, especially large-scale statistical NLP, is really interesting these days because there's so much data being produced all the time: there's the web (including social media), there's news, there's scholarly work, there's medical records, there's all the books that have been scanned. People complain about how nobody reads anymore, that we're basically semi-literate clods drooling onto our smartphones... but that's basically bunk. This is the most textual, information-rich time that has ever happened. (you might not like the *content* of the text, but that's sort of a separate issue)

And moreover, we have so much compute and so much bandwidth: you can do ASR cheaply, on a phone. And that's ridiculous, and it's never been true before.

It turns out that this is really hard. You might want your computer to be like a personal assistant, but to do language like a human being, you need so much knowledge about the world! To be really useful, you have to be able to decide not just what possible interpretations are, but the likely ones. Language is extremely ambiguous, and sentences that humans have no problems with have all kinds of ridiculous interpretations that are valid if you don't know about the world.

So what are some interesting NLP applications?

big recent examples: Siri. Watson. Google Translate!!

But let's consider some

(ask the class to list some)

Just in case these don't get mentioned...

so many of these have to do with interfaces and simple desktop-computer use.

    • wc (example with xinhua-cyprus.txt on tank)
    • spellchecking (query correction on Google)
    • whrr cna i byu liqqqr fter 10pm http://www.google.com/search?client=safari&rls=en&q=whrr+cna+i+byu+liqqqr+fter+10pm
    • predictive text!
    • gmail gestures! http://mail.google.com/mail/help/motion.html
      • but seriously, how do you do input with sign language? this is also a language problem.

voice recognition things

    • desktop computers
    • phone interfaces: Jelly Bean! Siri!
    • menu systems when you call a big company
    • audio transcriptions

information extraction

    • document summarization
    • inference ("textual entailment")
    • large-scale sentiment analysis for business purposes? (opinion mining)

speech synthesis

    • XTRANORMAL: http://www.xtranormal.com/watch/12552466/the-state-of-the-machine-learning-labor-market
    • GPS directions
    • the parking payment machine at the airport

machine translation

    • general-purpose on the web
    • special-purpose for industrial uses
    • publication-quality? ...
    • translation is interesting because: it's so easy to get in contact with people across the world, who speak many languages (maybe ones that you don't speak)

search engines. holy cow, search engines.

    • on the web
    • queries over sets of documents (your email? ...)

spam filtering is like the reverse of search engines

references and sources

Talk I gave February 2012:

http://hackmode.org/wiki/CatalystIntroToNlpTalk

Mike Gasser's lecture notes:

http://www.cs.indiana.edu/classes/b651/Notes/tasks.html

http://www.cs.indiana.edu/classes/b651/Notes/kb.html

http://www.cs.indiana.edu/classes/b651/Notes/stats.html

http://www.cs.indiana.edu/classes/b651/Notes/methods.html

Notes from Jason Eisner:

http://cs.jhu.edu/~jason/465/PDFSlides/lect01-intro.pdf