NAACL HLT 2009 Workshop on Semi-supervised Learning for Natural Language Processing

June 4, 2009
Boulder, Colorado, USA


  • Invited Keynote: Jason Eisner (JHU): Joint Models with Missing Data for Semi-supervised Learning
  • Panel Discussion: 
              Panelists: Hal Daume (U of Utah), Andrew Goldberg (U of Wisconsin-Madison), David McClosky (Brown University)
  • Invited Position Papers:
  • Technical Talks:
Latent Dirichlet Allocation with Topic-in-Set Knowledge
David Andrzejewski and Xiaojin Zhu

Coupling Semi-Supervised Learning of Categories and Relations
Andrew Carlson, Justin Betteridge, Estevam Rafael Hruschka Junior and Tom M. Mitchell

Can One Language Bootstrap the Other: A Case Study on Event Extraction
Zheng Chen and Heng Ji   

Keepin'It Real: Semi-Supervised Learning with Realistic Tuning
Andrew B. Goldberg and Xiaojin Zhu   

On Semi-Supervised Learning of Gaussian Mixture Models for Phonetic Classification
Jui-Ting Huang and Mark Hasegawa-Johnson   

A Simple Semi-supervised Algorithm For Named Entity Recognition
Wenhui Liao and Sriharsha Veeramachaneni   

An analysis of bootstrapping for the recognition of temporal expressions
Jordi Poveda, Mihai Surdeanu and Jordi Turmo   

A comparison of Structural Correspondence Learning and Self-training for Discriminative Parse Selection
Barbara Plank   

Surrogate Learning - From Feature Independence to Semi-Supervised Classification
Sriharsha Veeramachaneni and Ravi Kumar Kondadadi   

Is Unlabeled Data Suitable for Multiclass SVM-based Web Page Classification?
Arkaitz Zubiaga, Víctor Fresno and Raquel Martínez   

  • Semi-supervised Learning for NLP Bibliography
Click here for an editable wiki bibliography


Machine learning, be it supervised or unsupervised, has become an indispensable tool for natural language processing (NLP) researchers. Highly developed supervised training techniques have led to state-of-the-art performance for many NLP tasks and provide foundations for deployable NLP systems. Similarly, unsupervised methods, such as those based on EM training, have also been influential, with applications ranging from grammar induction to bilingual word alignment for machine translation.

Unfortunately, given the limited availability of annotated data, and the non-trivial cost of obtaining additional annotated data, progress on supervised learning often yields diminishing returns. Unsupervised learning, on the other hand, is not bound by the same data resource limits. However, unsupervised learning is significantly harder than supervised learning and, although intriguing, has not been able to produce consistently successful results for complex structured prediction problems characteristic of NLP.

It is becoming increasingly important to leverage both types of data resources, labeled and unlabeled, to achieve the best performance in challenging NLP problems. Consequently, interest in semi-supervised learning has grown in the NLP community in recent years. Yet, although several papers have demonstrated promising results with semi-supervised learning for problems such as tagging and parsing, we suspect that good results might not be easy to achieve across the board. Many semi-supervised learning methods (e.g. transductive SVM, graph-based methods) have been originally developed for binary classification problems. NLP problems often pose new challenges to these techniques, involving more complex structure that can violate many of the underlying assumptions.

We believe there is a need to take a step back and investigate why and how auxiliary unlabeled data can truly improve training for NLP tasks.

In particular, many open questions remain:

  1. Problem Structure: What are the different classes of NLP problem structures (e.g. sequences, trees, N-best lists) and what algorithms are best suited for each class? For instance, can graph-based algorithms be successfully applied to sequence-to-sequence problems like machine translation, or are self-training and feature-based methods the only reasonable choices for these problems?

  2. Background Knowledge: What kinds of NLP-specific background knowledge can we exploit to aid semi-supervised learning? Recent learning paradigms such as constraint-driven learning and prototype learning take advantage of our domain knowledge about particular NLP tasks; they represent a move away from purely data-agnostic methods and are good examples of how linguistic intuition can drive algorithm development.

  3. Scalability: NLP data-sets are often large. What are the scalability challenges and solutions for applying existing semi-supervised learning algorithms to NLP data?

  4. Evaluation and Negative Results: What can we learn from negative results? Can we make an educated guess as to when semi-supervised learning might outperform supervised or unsupervised learning based on what we know about the NLP problem?

  5. To Use or Not To Use: Should semi-supervised learning only be employed in low-resource languages/tasks (i.e. little labeled data, much unlabeled data), or should we expect gains even in high-resource scenarios (i.e. expecting semi-supervised learning to improve on a supervised system that is already more than 95% accurate)?

This workshop aims to bring together researchers dedicated to making semi-supervised learning work for NLP problems. Our goal is to help build a community of researchers and foster deep discussions about insights, speculations, and results (both positive and negative) that may otherwise not appear in a technical paper at a major conference. We welcome submissions that address any of the above questions or other relevant issues, and especially encourage authors to provide a deep analysis of data and results.


Qin Iris Wang (AT&T)
Kevin Duh (University of Washington)
Dekang Lin (Google Research)