Data sets

Trial data

The trial data may help developers to analyze the target format for tokenization and PoS tagging before the shared task will officially start. The trial data set includes:

2 000 tokens in the CMC subset with data samples from different CMC genres: empirist_trial_cmc.zip
2 000 tokens in the web corpora subset with samples from text genres on the web: empirist_trial_web.zip

The trial data were tokenized and PoS tagged by at least one student annotator according to the tokenization and PoS guidelines/tagset defined in our annotation guidelines. The annotation results have not been systematically checked for correctness. The trial data are intended to give a first impression of the task. Participants are allowed to use them for training and developing their systems, but should do so with care.

Training data

The training data set includes:

5 000 tokens in the CMC subset sampled from different CMC genres (social and professional chat, tweets, Wikipedia talk pages, blog comments, whatsapp conversations): empirist_training_cmc.zip
5 000 tokens in the web corpora subset sampled from different text genres on the Web: empirist_training_web.zip

The training data were independently tokenized and PoS tagged by at least two student annotators according to the tokenization and PoS guidelines/tagset defined in our annotation guidelines. Unclear cases were decided by an expert (project leader).

Test data (tokenization)

The test data set for the tokenization subtask includes:

CMC subset: text samples from different CMC genres (social and professional chat, tweets, Wikipedia talk pages, blog comments, whatsapp conversations): empirist_test_tok_cmc.zip
Web corpora subset: text samples from different genres on the web: empirist_test_tok_web.zip

The released test data for the tokenization subtask have neither been tokenized nor PoS tagged. They will serve for the evaluation of the performance of participants' systems against a manually created tokenization. Note that this data set includes filler data — the evaluation will be based on 5 000 tokens from each subset embedded in the text files.

Test data (PoS tagging)

The test data set for the PoS tagging subtask includes:

Tokens from the CMC data set with data samples from different CMC genres (social and professional chat, tweets, Wikipedia talk pages, blog comments, whatsapp conversations): empirist_test_pos_cmc.zip
Tokens from the web corpora data set with samples from text genres on the web: empirist_test_pos_web.zip

The released test data for the PoS tagging subtsask have been manually tokenized but not PoS tagged. They will serve for the evaluation of the performance of participants' systems against a manually created PoS annotation.

Gold standard test data

Full gold standard tokenization and POS tagging for all texts in the EmpiriST test set:

CMC subset: text samples from different CMC genres (social and professional chat, tweets, Wikipedia talk pages, blog comments, whatsapp conversations): empirist_gold_cmc.zip
Web corpora subset: text samples from different genres on the web: empirist_gold_web.zip

Sources and tools used for creating the data sets

The CMC data sets comprise samples from various sources: the Dortmund Chat Corpus, the data set collected in the project "WhatsUp, Deutschland?", the DWDS blog corpus collected by Adrien Barbaresi, the German Wikipedia and a collection of donated tweets.
The CMC data sets have been annotated using the corpus annotation tool CorA. We thank Marcel Bollmann & Stefanie Dipper (Ruhr-Universität Bochum) for their support.

Page updated

Google Sites

Report abuse