Instructions & schedule

01.10.2015
Release of trial data and annotation guidelines
15.12.2015Release of the training data
14.02.2016Deadline for team registration
15.02.2016
Release of the evaluation data for the tokenization subtask
19.02.2016Submission of automatically tokenized evaluation data by the participants
22.02.2016Release of the evaluation data for the PoS tagging subtask
26.02.2016Submission of automatically PoS tagged evaluation data by the participants
22.03.2016Evaluation results and gold standard data released to participants    
08.05.2016Submission of system description papers (8 pages + references) 
12.08.2016
Presentation of systems and task results at WAC-X workshop (ACL 2016, Berlin)


Team registration

Teams wishing to participate in the task are requested to register by 14.02.2016. Please send an e-mail with the following information to empirist@collocations.de:
  • Team name (will be used to identify submissions)
  • Name(s) of team member(s)
  • Affiliation(s)
  • Subtasks you plan to participate in (CMC Tok, CMC POS, Web Tok, Web POS)
  • Contact person and e-mail address
Task participants should also join our Google group at https://groups.google.com/d/forum/empirist2015.


General instructions

  • Participating systems are allowed to use external resources (lexicons, word lists, existing taggers, …) and training data (e.g. the TIGER treebank). Such resources must be declared with the result submission. Systems that use closed resources (such as newly annotated data or a commercial tokenizer/tagger) may be ranked separately.
  • Each team can submit up to 3 runs of their system (with different parameter settings) for each subtask. The best of these runs will be used in the official ranking (but the other runs will also be shown).
  • If a group develops multiple systems with substantially different approaches, these should be registered as different teams.  For each system, up to 3 runs can be submitted.
  • The evaluation will be carried out in two phases: tokenization (15.02.–19.02.2016) and POS tagging (22.02.–26.02.2016). The evaluation data sets will be published on this Web site at the start of the evaluation period. System results have to be submitted before the end of the respective evaluation period.
  • All evaluation data sets will be provided as plain text files in UTF-8 encoding with Unix line breaks (LF, ASCII code 10) and no BOM. System results must be submitted in the same format. Participants are strongly encouraged to check their submissions with validation scripts that will be included in the release of the evaluation data.

Phase 1: Tokenization

The input data are raw text files from CMC logs or extracted from Web pages. Empty lines mark text structure boundaries (e.g. chat postings in the CMC subset, or headings and paragraphs in the Web corpora subset). Additional metadata may be provided in the form of empty XML elements on separate lines.

<posting id="chat51_03" author="schtepf"/>
Gehe heute ins Kino... :-)

<posting id="chat51_04" author="bucky"/>
*freu*

Fig. 1: Example of an input file raw/ibk001.txt for the tokenization phase.

Participating systems must tokenize these files according to the EmpiriST tokenization guidelines, so that each token is on a separate line. Systems are only allowed to insert and delete whitespace in the text files. All other characters must remain unaltered, otherwise the submission cannot be evaluated. Except for metadata tags, the only allowed whitespace in output files are unix linebreaks (LF). A validation script provided with the evaluation data can be used to verify the file format.

<posting id="chat51_03" author="schtepf"/>
Gehe
heute
ins
Kino
...
:-)

<posting id="chat51_04" author="bucky"/>
*
freu
*

Fig. 2: Example of an output file tokenized/ibk001.txt from the tokenization phase.

Participating systems may preserve metadata tags and empty lines in their original form or delete them from the output altogether, as they will not be considered in the evlauation.

System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 2).

We will use precision, recall and F1-score for token boundaries as evaluation metrics (Jurish & Würzner 2013). Systems will be ranked based on their F1-score.


Phase 2: POS Tagging

Input data are tokenized text files in the same format as the output files of the tokenization phase (Fig. 2), with empty lines as text boundary markers and empty XML tags on separate containing additional metadata.

Participating systems must annotate each token line with a suitable POS tag from the extended STTS-EmpiriST tag set, according to the EmpiriST tagging guidelines. Tags are separated from tokens by a TAB character (\t, ASCII code 9). Metadata tags and blank lines should be ignored, but can be used to provide hints for the tagger. They may either be preserved without a POS tag or removed completely, and will not be considered in the evaluation.

<posting id="chat51_03" author="schtepf"/>

Gehe\tVVFIN

heute\tADV

ins\tAPPRART

Kino\tNN

...\t$.

:-)\tEMOASC


<posting id="chat51_04" author="bucky"/>

*\t$(

freu\tAKW

*\t$(


Fig. 3: Example of an output file tagged/ibk001.txt from the tagging phase.

System results must be submitted as a ZIP archive of plain text files, whose filenames must be identical to those of the corresponding input files (cf. Fig. 3).

We will use tagging accuracy as a main evaluation criterion for the official ranking.

Mapping to STTS 1.0
In order to facilitate a comparison with existing taggers, we also compute accuracy based on the standard STTS 1.0 tag set, using the mapping defined by the table below:

for gold tagthe following tags are also accepted:comment
 EMOASC     XY ITJ EMOIMG 
 EMOIMG XY ITJ EMOASC 
 AKW VVFIN VVIMP VVINF VVIZU VAFIN VAIMP VAINF VMFIN VMINF 
 HST
 XY 
 ADR  XY NE 
 URL XY 
 EML XY 
 VVPPER VVFIN VVIMP VVINF 
 VMPPER VMFIN VMINF 
 VAPPER VAFIN VAIMP VAINF 
 KOUSPPER KOUS 
 PPERPPER PPER 
 ADVART ART 
 PTKIFG ADV ADJD PTKMA PTKMWL 
 PTKMA ADV ADJD PTKIFG PTKMWL 
 PTKMWL ADV ADJD PTKIFG PTKMA 
 DM KOUS ADV 
 ONO ITJ VVFIN VVIMP VVINF 
 ADV PTKIFG PTKMA PTKMWL DMavoid bias in favour of STTS-1.0 tagger
 KOUS DMavoid bias in favour of STTS-1.0 tagger
 PIDAT PIATTIGER corpus doesn't have PIDAT, uses PIAT instead



References

Jurish, Bryan and Würzner, Kay-Michael (2013). Word and sentence tokenization with hidden markov models. Journal for Language Technology and Computational Linguistics (JLCL), 28(2), 61-83.

Comments