Shared Task



The field of syntactic parsing has seen a lot of progress over the last two decades. Current parsers achieve accuracies well above 90%, and as such promise to become an integral part of downstream applications that rely on high accuracy syntactic analysis. However, these 90%+ accuracies typically are limited to heavily edited domains such as newswire. Unfortunately, applications that rely on parsing, such as machine translation, sentiment analysis and information extraction are more often than not applied on unedited domains, especially those common on the web. Such domains include blogs, discussion forums, consumer reviews, etc. In order to reliably translate and extract information from the web, progress must be made in parsing such texts.

There are multiple reasons that parsing the web is difficult, all of which stem from a mismatch with the training data, which is typically the Wall Street Journal (WSJ) portion of the Penn Treebank (PTB). Punctuation and capitalization are often inconsistent, making it difficult to rely on features that can be predictive for newswire. There is often a lexical shift due to increased use of slang, technical jargon or other phenomena. There is an increase in ungrammatical sentences. Another important factor is that some syntactic constructions are more frequent in web text than in newswire: most notably questions, imperatives, long lists of names and sentence fragments.

Unfortunately, there are currently few high quality test sets available for evaluating parsers on such noisy web texts, forcing researchers to keep evaluating on a now 20 year old test (WSJ Section 23).

However, with the recent construction of the Google Web Treebank and the release of OntoNotes 4.0, there is now a large enough set of manually annotated web text in order to evaluate parsing systems accurately. When coupled with a large corpus of unlabeled data, the possibilities for semi-supervised learning and domain adaptation become tangible.


Participants in the shared task will be provided with three sets of data:

(1) WSJ portion of Ontonotes 4.0 (approx 30,000 sentences, sections 02-21).
(2) Five sets of unlabeled sentences (5 x 100,000 sentences).
(3) Two domains from the new Google Web Treebank (2 x 2,000 parsed sentences).

The task is to build the best possible parser by using only data sets (1) and (2). Data set (3) is provided as a development set, while the official test set will consist of the remaining three domains of the Google Web Treebank. The goal is to build a single system that can robust parse all domains, rather than to build several domain-specific systems. We require all participating systems to only submit results trained on data sets (1) and (2). I.e., we do not allow the addition of other labeled or unlabeled data. In particular the development data set (3) should not be used for training the final system.

On the use of additional resources: It is permissible to use previously constructed lexicons, word clusters or other resources provided that they are made available for other participants. Please email the organizers ( if you wish to ask about a resource. Furthermore, we ask that you if possible you submit two sets of results, the first with the extra resources and the second without the extra resources.

For the shared task we will be using the portion of the WSJ from OntoNotes 4.0 and not the full original treebank. This is because OntoNotes 4.0 and the Google Web Treebank share annotation standards, which are slightly different from the original PTB in terms of tokenization and noun-phrases analysis.

There will be two tracks, one for constituency parsers and one for dependency parsers (we will also convert the output of the constituency parsers to dependencies). The test data won’t be annotated with part-of-speech (POS) tags, and the participants will be expected to run their own POS tagger (either as part of the parser or as a standalone pre-processing component).

The format of the data will be bracketed phrases for the constituent trees, without function labels and empty nodes (as is standard in the parsing community). Stanford dependencies in CoNLL06/07 data format will be used for the dependency task.

Systems will be evaluated using standard tools: evalb (for constituent labeled precision and recall) and the CoNLL 2006 (for unlabeled and labeled attachment score). 

Participating teams will be invited to submit a short system description (2-3 pages), which will not be published, but posted on the shared-task website. This leaves the option open to teams to publish their results later or perhaps put together a special edition of a journal. Participating team will also be invited to the workshop to either give a presentation or a poster presentation on their system.

To participate, please sign this form and send a scanned copy to and You can also fax it to +1-215-573-2175.


January 20th: Release of training and development data plus unlabeled data sets.
April 23rd: Release of blind test sets
April 30rd: Results due
May 14th:         Short system descriptions due
June 7/8th: Workshop

Shared Task Organizers

Slav Petrov (Google)
Ryan McDonald (Google)