Gold standard
The EmpiriST 2015 gold standard comprises more than 20,000 tokens of CMC and Web corpora data that have been manually tokenized and annotated with PoS tags. It is divided into training and tests sets, which correspond exactly to the data used in the shared task. The gold standard is released under a Creative Commons CC BY-SA 3.0 licence.
Download the gold standard: empirist_gold_standard.zip
If you use these data, please cite our task description paper:
Michael Beißwenger, Sabine Bartsch, Stefan Evert and Kay-Michael Würzner (2016). EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pages 78–90. Berlin, Germany. [pdf]
Data sizes
Training data:
CMC subset: 5,109 tokens in 8 files
Web corpora subset: 4,944 tokens in 11 files
Test data:
CMC subset: 5,234 tokens in 6 files
Web corpora subset: 7,568 tokens in 12 files
File formats
Raw texts are provided as plain text files in UTF-8 encoding with Unix line endings (LF). Empty lines indicate text structure (postings or paragraph boundaries), and metadata are included as empty XML elements on separate lines.
Manually tokenized texts are provided in one-word-per-line format (sometimes referred to as “verticalized text”), also in Unix UTF-8 encoding. No transformations were applied to the character data in the raw text except for deleting whitespace and inserting line breaks. XML elements and empty lines are preserved in the tokenized files.
Manually tagged texts are identical to the tokenized files except that a PoS tag (from the extended STTS_IBK tag set) is appended to each token, separated by a TAB stop. Again, XML elements and empty lines are preserved without a PoS tag.
Software tools
The gold standard distribution includes several Perl scripts to facilitate use of and evaluation against the gold standard:
validate_tokenization.perl: used by task participants to check the format of system output files in the tokenization phase
compare_tokenization.perl: the official scorer for the tokenization subtask
validate_tagging.perl: used by task participants to check the format of system output files in the tagging phase
compare_tagging.perl: the official scorer for the PoS tagging subtask
normalize_text.perl: text cleanup and whitespace tokenization (used as a basis for the manual tokenization of the gold standard)
line_count.perl: count number of tokens in one-word-per-line files (automatically skips empty lines and XML elements)