Participants in the shared task will be provided with three sets of data:
1. WSJ section of the Penn Treebank.
2. Five sets of unlabeled sentences (5 x 100,000 sentences).
3. Two domains from the new Google Web Treebank (2 x 2,000 parsed sentences).
The task would be to build the best possible parser by using only data sets 1. and 2. Data set 3. is provided as a development set, while the official test set will consist of the remaining three domains of the Google Web Treebank. There will be two tracks, one for constituency parsers and one for dependency parsers (we will also convert the output of the constituency parsers to dependencies). The test data won't be annotated with part-of-speech (POS) tags, and the participants will be expected to run their own POS tagger (either as part of the parser or as a standalone pre-processing component).
Systems will be evaluated using standard tools: evalb (for constituent labeled precision and recall) and the CoNLL 2006 eval.pl (for unlabeled and labeled attachment score).
Currently, our plan is to ask participants to submit a parser binary that can be run server-side on the test set using mlcomp.org. The primary advantage of this is that it will allow us to measure parser speed, which is of critical importance when parsing the web. However, we are aware that this might pose problems. Some participants might come from organizations that do not allow software releases. Others might use third party proprietary dependencies (e.g., cplex) that would prohibit them from participating. We will iron this out with participants. One solution would be to have two tracks, one where binaries are submitted and another where only system output is submitted. The fall back plan would be to collect system outputs only, as is commonly done for CoNLL shared tasks.