Instructions & schedule

Team registration

Teams wishing to participate in the task are requested to register by 14.02.2016. Please send an e-mail with the following information to empirist@collocations.de:

Task participants should also join our Google group at https://groups.google.com/d/forum/empirist2015.

General instructions

Phase 1: Tokenization

The input data are raw text files from CMC logs or extracted from Web pages. Empty lines mark text structure boundaries (e.g. chat postings in the CMC subset, or headings and paragraphs in the Web corpora subset). Additional metadata may be provided in the form of empty XML elements on separate lines.

<posting id="chat51_03" author="schtepf"/>

Gehe heute ins Kino... :-)

<posting id="chat51_04" author="bucky"/>

*freu*

Fig. 1: Example of an input file raw/ibk001.txt for the tokenization phase.

Participating systems must tokenize these files according to the EmpiriST tokenization guidelines, so that each token is on a separate line. Systems are only allowed to insert and delete whitespace in the text files. All other characters must remain unaltered, otherwise the submission cannot be evaluated. Except for metadata tags, the only allowed whitespace in output files are unix linebreaks (LF). A validation script provided with the evaluation data can be used to verify the file format.

<posting id="chat51_03" author="schtepf"/>

Gehe

heute

ins

Kino

...

:-)

<posting id="chat51_04" author="bucky"/>

*

freu

*

Fig. 2: Example of an output file tokenized/ibk001.txt from the tokenization phase.

Participating systems may preserve metadata tags and empty lines in their original form or delete them from the output altogether, as they will not be considered in the evlauation.

System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 2).

We will use precision, recall and F1-score for token boundaries as evaluation metrics (Jurish & Würzner 2013). Systems will be ranked based on their F1-score.

Phase 2: POS Tagging

Input data are tokenized text files in the same format as the output files of the tokenization phase (Fig. 2), with empty lines as text boundary markers and empty XML tags on separate containing additional metadata.

Participating systems must annotate each token line with a suitable POS tag from the extended STTS-EmpiriST tag set, according to the EmpiriST tagging guidelines. Tags are separated from tokens by a TAB character (\t, ASCII code 9). Metadata tags and blank lines should be ignored, but can be used to provide hints for the tagger. They may either be preserved without a POS tag or removed completely, and will not be considered in the evaluation.

<posting id="chat51_03" author="schtepf"/>

Gehe\tVVFIN

heute\tADV

ins\tAPPRART

Kino\tNN

...\t$.

:-)\tEMOASC

<posting id="chat51_04" author="bucky"/>

*\t$(

freu\tAKW

*\t$(

Fig. 3: Example of an output file tagged/ibk001.txt from the tagging phase.

System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 3).

We will use tagging accuracy as a main evaluation criterion for the official ranking.

Mapping to STTS 1.0

In order to facilitate a comparison with existing taggers, we also compute accuracy based on the standard STTS 1.0 tag set, using the mapping defined by the table below:

References

Jurish, Bryan and Würzner, Kay-Michael (2013). Word and sentence tokenization with hidden markov models. Journal for Language Technology and Computational Linguistics (JLCL), 28(2), 61-83.