ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP)
July 15, 2010
Most modern Natural Language Processing (NLP) systems are subject to the wellknown problem of lack of portability to new domains/genres of language: there is a substantial drop in their performance when tested on data from a new domain, i.e., their test data is drawn from a related but different distribution as their training data. This problem is inherent in the assumption of independent and identically distributed (i.i.d.) variables for machine learning systems, but has started to get attention only in recent years. The need for domain adaptation arises in almost all NLP tasks: part-of-speech tagging, semantic role labeling, statistical parsing and statistical machine translation, to name but a few.
The goal of this workshop is to provide a meeting-point for research that approaches the problem of adaptation from the varied perspectives of machine-learning and a variety of NLP tasks such as parsing, machine-translation, word sense disambiguation, etc. We believe there is much to gain by treating domain-adaptation as a general learning strategy that utilizes prior knowledge of a specific or a general domain in learning about a new domain; here the notion of a “domain” could be as varied as child language versus adult-language, or the source-side re-ordering of words to target-side word-order in a statistical machine translation system.
Sharing insights, methodologies and successes across tasks and perspectives will thus contribute towards a better understanding of this problem. For instance, self-training the Charniak parser alone was not effective for adaptation (it has been common wisdom that self-training is generally not effective), but self-training with a re-ranker was surprisingly highly effective (McClosky et al., 2006). Is this an insight into adaptation that can be used elsewhere? We believe that the key to future success will be to exploit large collections of unlabeled data in addition to labeled data. Not only because unlabeled data is easier to obtain, but existing labeled resources are often not even close to the envisioned target application domain. Directly related is the question of how to measure closeness (or differences) among domains.
We therefore especially encourage submissions on semi-supervised approaches of domain adaptation with a deep analysis of models, data and results, although we do not exclude papers on supervised adaptation.
John Blitzer, University of California at Berkeley, USA: Unsupervised Domain Adaptation: From Practice to Theory.
Hal Daumé III, University of Utah, USA
Tejaswini Deoskar, University of Amsterdam, The Netherlands
David McClosky, Stanford University, USA
Barbara Plank, University of Groningen, The Netherlands
Jörg Tiedemann, Uppsala University, Sweden
This workshop is kindly supported by the Stevin project PaCo-MT (Parse and Corpus-based Machine Translation) .