Gold standard

The EmpiriST 2015 gold standard comprises more than 20,000 tokens of CMC and Web corpora data that have been manually tokenized and annotated with PoS tags.  It is divided into training and tests sets, which correspond exactly to the data used in the shared task.  The gold standard is released under a Creative Commons CC BY-SA 3.0 licence.

If you use these data, please cite our task description paper:

Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner (2016). EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pages 78–90. Berlin, Germany. [pdf]

Data sizes

Training data:

Test data:

File formats

Software tools

The gold standard distribution includes several Perl scripts to facilitate use of and evaluation against the gold standard: