Data sets
The data for the GermEval 2015: LexSub task is described in Cholakov et al., 2013. All together it consists of 2040 sentences from the German Wikipedia, each containing a target word and a list of substitutions proposed by human annotators. There are 153 unique target words, equally distributed across parts of speech (nouns, verbs, and adjectives) and three frequency groups. About half of this data (26 nouns, 26 verbs, and 26 adjectives in 1040 sentence contexts) forms the training set, which is made available to participants in advance. The remainder forms the test set, which will be used for the evaluation and published in full only after the shared task is completed.
Download
Download
- Archive containing the complete GermEval 2015: LexSub data and software (including trial, training, and test data; baseline and system results; and the scoring software)