Natural Language processing

Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken. NLP is a component of artificial intelligence. The development of NLP applications is challenging because computers traditionally require humans to “speak” to them in a programming language that is precise, unambiguous and highly structured or, perhaps through a limited number of clearly-enunciated voice commands. Human speech, however, is not always precise -- it is often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects and social context.

Current approaches to NLP are based on machine learning, a type of artificial intelligence that examines and uses patterns in data to improve a program's own understanding. Most of the research being done on natural language processing revolves around search, especially enterprise search.

Common NLP tasks in software programs today include:

  • Sentence segmentation, part-of-speech tagging and parsing.

  • Deep analytics.

  • Named entity extraction.

  • Co-reference resolution.

It is a vast topic and I have only worked on a simple python toolkit called NLTK, Natural Language Toolkit. Here I will describe some of the basic operations that can be executed by it. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.

Simple Processing Tasks

The NLTK modules include:

  • token: classes for representing and processing individual elements of text, such as words and sentence.

  • probability: classes for representing and processing probabilistic information.

  • tree: classes for representing and processing hierarchical information over text.

  • cfg: classes for representing and processing context free grammars.

  • fsa: finite state automata

  • tagger: The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags, and tagging text automatically. A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word.

>>> text = word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

  • parser: building trees over text (includes chart, chunk and probabilistic parser)

  • classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes)

  • draw: visualize NLP structures and processes

  • corpus: access (tagged) corpus data

Some commonly used modules that are used extensively are:

This is the official book. Though the explanations are brief, most of the topics has been touched. (Python 3 and above)

Another good book is this. I followed this one for its expansiveness and completeness.

1. Automatic Text Summarization

  • J. Clarke and M. Lapata. Modeling Compression with Discourse Constraints. EMNLP-CoNLL 2007. (shows importance of joint inference)

  • K. Knight and D. Marcu. Summarization beyond sentence extraction. Artificial Intelligence 139, 2002. (opens the door to statistical approach to sentence compression)

  • R. McDonald. A Study of Global Inference Algorithms in Multi-Document Summarization ECIR 2007. (formulates summarization task as global optimization problem using integer linear programming)

  • W. Yih et al. Multi-Document Summarization by Maximizing Informative Content-Words. IJCAI 2007. (introduces stack decoding to this field)

2. Information Extraction

  • Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992. (The very first paper for all the bootstrapping methods for NLP. It is a hypothetical work in a sense that it doesn't give experimental results, but it influenced it's followers a lot.)

  • Collins and Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999. (It applies several variants of co-training like IE methods to NER task and gives the motivation why they did so.We can learn the logic for writing a good research paper in NLP from this work .)

3. Computational Semantics

  • Gildea and Jurafsky. Automatic Labeling of Semantic Roles. Computational Linguistics 2002. (It opened up the trends in NLP for semantic role labeling, followed by several CoNLL shared tasks dedicated for SRL. It shows how linguistics and engineering can collaborate with each other.)

  • Pantel and Lin. Discovering Word Senses from Text. KDD 2002. (Supervised WSD ( Word Sense Disambiguation ) has been explored a lot in the early 00's thanks to the senseval workshop, but a few system actually benefits from WSD because manually crafted sense mappings are hard to obtain. These days we see a lot of evidence that unsupervised clustering improves NLP tasks such as NER, parsing, SRL, etc, and this work is one of the roots of unsupervised clustering of words)

4. Information Extraction

  • Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. COLING 1992. (The very first paper for all the bootstrapping methods for NLP. It is a hypothetical work )

  • Collins and Singer. Unsupervised Models for Named Entity Classification. EMNLP 1999. (It applies several variants of co-training like IE methods to NER task and gives the motivation why they did so. )

In India, reputed places where NLP work is going on is Center for Indian Languages Technology in IIT Bombay and C-DAC.

I have categorized some good papers according to their sub-fields within NLP. I prepared this list when I was interning at IIT Guwahati. If you encounter any broken link, please drop a message.