SANCL 2012 - Home‎ > ‎Shared Task‎ > ‎

Data


***To participate, please sign this form and send a scanned copy to ldc@ldc.upenn.edu and parsingtheweb@gmail.com. You can also fax it to +1-215-573-2175.***

The Google Web Treebank which forms the basis of this shared task was produced by Google in collaboration with the Linguistic Data Consortium and contains five types of text typically found on the web: blogs, email, forums, reviews and Q&A.

For each of the five domains (blogs, emails, reviews, forum, Q&A) in the Google Web Treebank, a large pool of over 100,000 sentences was collected from the web. 2,000 sentences were then selected from each domain and manually annotated with parse trees. The same annotation guidelines were used as for the PTB, making it possible to train parsers on the WSJ and evaluate their performance on this new treebank. The constituency trees will be additionally converted with the Stanford constituency-to-dependency converter to generate Stanford-style dependencies. Furthermore, the availability of unlabeled data from the same domain makes it possible to explore semi-supervised learning approaches for domain adaptation.

The constituency trees will be in standard PTB bracketing format and the dependency format will be CoNLL 2006/2007. Unlabeled sentences will be text files with one pre-tokenized sentence per line. The LDC is supportive of this proposal and can help with data distribution and licensing.

The blogs, email, forums, and Q&A corpora were collected by the LDC following their usual procedures. The review corpus was created from user-contributed reviews for which Google has distribution rights. Only title text and body text are provided. In addition, any review with potentially identifiable data such as phone numbers, credit card numbers, SSNs, or IP addresses was removed using the de-identification procedures that were developed for the Google N-gram corpus distributed by the LDC.