Data sets

Trial data

The trial data may help developers to analyze the target format for tokenization and PoS tagging before the shared task will officially start. The trial data set includes:

The trial data were tokenized and PoS tagged by at least one student annotator according to the tokenization and PoS guidelines/tagset defined in our annotation guidelines. The annotation results have not been systematically checked for correctness. The trial data are intended to give a first impression of the task. Participants are allowed to use them for training and developing their systems, but should do so with care.

Training data

The training data set includes:

The training data were independently tokenized and PoS tagged by at least two student annotators according to the tokenization and PoS guidelines/tagset defined in our annotation guidelines. Unclear cases were decided by an expert (project leader).

Test data (tokenization)

The test data set for the tokenization subtask includes:

The released test data for the tokenization subtask have neither been tokenized nor PoS tagged. They will serve for the evaluation of the performance of participants' systems against a manually created tokenization. Note that this data set includes filler data — the evaluation will be based on 5 000 tokens from each subset embedded in the text files.

Test data (PoS tagging)

The test data set for the PoS tagging subtask includes:

The released test data for the PoS tagging subtsask have been manually tokenized but not PoS tagged. They will serve for the evaluation of the performance of participants' systems against a manually created PoS annotation.

Gold standard test data

Full gold standard tokenization and POS tagging for all texts in the EmpiriST test set:

Sources and tools used for creating the data sets