Shared Task on Domain Adaptation for Parsing the Web
At the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL)
At HLT-NAACL 2012
Organizers: Slav Petrov and Ryan McDonald
The field of syntactic parsing has seen a lot of progress over the
last two decades. Current parsers achieve accuracies well above 90%,
and as such promise to become an integral part of downstream
applications that rely on high accuracy syntactic analysis. However,
these 90%+ accuracies typically are limited to heavily edited domains
such as newswire. Unfortunately, applications that rely on parsing,
such as machine translation, sentiment analysis and information
extraction are more often than not applied on unedited domains,
especially those common on the web. Such domains include blogs,
discussion forums, consumer reviews, etc. In order to reliably
translate and extract information from the web, progress must be made
in parsing such texts.
There are multiple reasons that parsing the web is difficult, all of
which stem from a mismatch with the training data, which is typically
the Wall Street Journal (WSJ) portion of the Penn Treebank (PTB).
Punctuation and capitalization are often inconsistent, making it
difficult to rely on features that can be predictive for newswire.
There is often a lexical shift due to increased use of slang,
technical jargon or other phenomena. There is an increase in
ungrammatical sentences. Another important factor is that some
syntactic constructions are more frequent in web text than in
newswire: most notably questions, imperatives, long lists of names and
Unfortunately, there are currently few high quality test sets
available for evaluating parsers on such noisy web texts, forcing
researchers to keep evaluating on a now 20 year old test (WSJ Section
However, with the recent construction of the Google Web Treebank and
the release of OntoNotes 4.0, there is now a large enough set of
manually annotated web text in order to evaluate parsing systems
accurately. When coupled with a large corpus of unlabeled data, the
possibilities for semi-supervised learning and domain adaptation
Participants in the shared task will be provided with three sets of data:
(1) WSJ portion of Ontonotes 4.0 (approx 30,000 sentences, sections 02-21).
(2) Five sets of unlabeled sentences (5 x 100,000 sentences).
(3) Two domains from the new Google Web Treebank (2 x 2,000 parsed sentences).
The task is to build the best possible parser by using only data sets
(1) and (2). Data set (3) is provided as a development set, while the
official test set will consist of the remaining three domains of the
Google Web Treebank. The goal is to build a single system that can
robust parse all domains, rather than to build several domain-specific
systems. We require all participating systems to only submit results
trained on data sets (1) and (2). I.e., we do not allow the addition
of other labeled or unlabeled data. In particular the development data
set (3) should not be used for training the final system.
For the shared task we will be using the portion of the WSJ from
OntoNotes 4.0 and not the full original treebank. This is because
OntoNotes 4.0 and the Google Web Treebank share annotation standards,
which are slightly different from the original PTB in terms of
tokenization and noun-phrases analysis.
There will be two tracks, one for constituency parsers and one for
dependency parsers (we will also convert the output of the
constituency parsers to dependencies). The test data won’t be
annotated with part-of-speech (POS) tags, and the participants will be
expected to run their own POS tagger (either as part of the parser or
as a standalone pre-processing component).
The format of the data will be bracketed phrases for the constituent
trees, without function labels and empty nodes (as is standard in the
parsing community). Stanford dependencies in CoNLL06/07 data format
will be used for the dependency task.
Systems will be evaluated using standard tools: evalb (for constituent
labeled precision and recall) and the CoNLL 2006 eval.pl (for
unlabeled and labeled attachment score).
Participating teams will be invited to submit a short system
description (2-3 pages), which will not be published, but posted on
the shared-task website. This leaves the option open to teams to
publish their results later or perhaps put together a special edition
of a journal. Participating team will also be invited to the workshop
to either give a presentation or a poster presentation on their
To participate, please send an email to firstname.lastname@example.org and we
will follow up with more detailed instructions.
Jan. 20th: Release of training and development data plus unlabeled data sets.
April 23rd: Release of blind test sets
April 30rd: Results due
May 14th: Short system descriptions due
June 8th: Workshop