Instructions & schedule
Team registration
Teams wishing to participate in the task are requested to register by 14.02.2016. Please send an e-mail with the following information to empirist@collocations.de:
Team name (will be used to identify submissions)
Name(s) of team member(s)
Affiliation(s)
Subtasks you plan to participate in (CMC Tok, CMC POS, Web Tok, Web POS)
Contact person and e-mail address
Task participants should also join our Google group at https://groups.google.com/d/forum/empirist2015.
General instructions
Participating systems are allowed to use external resources (lexicons, word lists, existing taggers, …) and training data (e.g. the TIGER treebank). Such resources must be declared with the result submission. Systems that use closed resources (such as newly annotated data or a commercial tokenizer/tagger) may be ranked separately.
Each team can submit up to 3 runs of their system (with different parameter settings) for each subtask. The best of these runs will be used in the official ranking (but the other runs will also be shown).
If a group develops multiple systems with substantially different approaches, these should be registered as different teams. For each system, up to 3 runs can be submitted.
The evaluation will be carried out in two phases: tokenization (15.02.–19.02.2016) and POS tagging (22.02.–26.02.2016). The evaluation data sets will be published on this Web site at the start of the evaluation period. System results have to be submitted before the end of the respective evaluation period.
All evaluation data sets will be provided as plain text files in UTF-8 encoding with Unix line breaks (LF, ASCII code 10) and no BOM. System results must be submitted in the same format. Participants are strongly encouraged to check their submissions with validation scripts that will be included in the release of the evaluation data.
Phase 1: Tokenization
The input data are raw text files from CMC logs or extracted from Web pages. Empty lines mark text structure boundaries (e.g. chat postings in the CMC subset, or headings and paragraphs in the Web corpora subset). Additional metadata may be provided in the form of empty XML elements on separate lines.
<posting id="chat51_03" author="schtepf"/>
Gehe heute ins Kino... :-)
<posting id="chat51_04" author="bucky"/>
*freu*
Fig. 1: Example of an input file raw/ibk001.txt for the tokenization phase.
Participating systems must tokenize these files according to the EmpiriST tokenization guidelines, so that each token is on a separate line. Systems are only allowed to insert and delete whitespace in the text files. All other characters must remain unaltered, otherwise the submission cannot be evaluated. Except for metadata tags, the only allowed whitespace in output files are unix linebreaks (LF). A validation script provided with the evaluation data can be used to verify the file format.
<posting id="chat51_03" author="schtepf"/>
Gehe
heute
ins
Kino
...
:-)
<posting id="chat51_04" author="bucky"/>
*
freu
*
Fig. 2: Example of an output file tokenized/ibk001.txt from the tokenization phase.
Participating systems may preserve metadata tags and empty lines in their original form or delete them from the output altogether, as they will not be considered in the evlauation.
System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 2).
We will use precision, recall and F1-score for token boundaries as evaluation metrics (Jurish & Würzner 2013). Systems will be ranked based on their F1-score.
Phase 2: POS Tagging
Input data are tokenized text files in the same format as the output files of the tokenization phase (Fig. 2), with empty lines as text boundary markers and empty XML tags on separate containing additional metadata.
Participating systems must annotate each token line with a suitable POS tag from the extended STTS-EmpiriST tag set, according to the EmpiriST tagging guidelines. Tags are separated from tokens by a TAB character (\t, ASCII code 9). Metadata tags and blank lines should be ignored, but can be used to provide hints for the tagger. They may either be preserved without a POS tag or removed completely, and will not be considered in the evaluation.
<posting id="chat51_03" author="schtepf"/>
Gehe\tVVFIN
heute\tADV
ins\tAPPRART
Kino\tNN
...\t$.
:-)\tEMOASC
<posting id="chat51_04" author="bucky"/>
*\t$(
freu\tAKW
*\t$(
Fig. 3: Example of an output file tagged/ibk001.txt from the tagging phase.
System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 3).
We will use tagging accuracy as a main evaluation criterion for the official ranking.
Mapping to STTS 1.0
In order to facilitate a comparison with existing taggers, we also compute accuracy based on the standard STTS 1.0 tag set, using the mapping defined by the table below:
References
Jurish, Bryan and Würzner, Kay-Michael (2013). Word and sentence tokenization with hidden markov models. Journal for Language Technology and Computational Linguistics (JLCL), 28(2), 61-83.