Data and Task Setup


The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation [1] with the following properties:

  • The data was sampled from German Wikipedia and News Corpora as a collection of citations.
  • The dataset covers over 31,000 sentences corresponding to over 590,000 tokens.
  • The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].

The data are available for download at the bottom of this page. (Look for attachments and click on the Download arrow). This data set is distributed under the CC-BY license https://creativecommons.org/licenses/by/4.0/.

Please note: Datasets have been updated on July 22 to correct inconsistencies. Please use the July 22 train and dev versions for your submissions.

DATA FORMAT

The following snippet shows an example of the TSV format we use in this task. 

# http://de.wikipedia.org/wiki/Manfred_Korfmann [2009-10-17]
1 Aufgrund O O
2 seiner O O
3 Initiative O O
4 fand O O
5 2001/2002 O O
6 in O O
7 Stuttgart B-LOC O
8 , O O
9 Braunschweig B-LOC O
10 und O O
11 Bonn B-LOC O
12 eine O O
13 große O O
14 und O O
15 publizistisch O O
16 vielbeachtete O O
17 Troia-Ausstellung B-LOCpart O
18 statt O O
19 , O O
20 O O
21 Troia B-OTH B-LOC
22 - I-OTH O
23 Traum I-OTH O
24 und I-OTH O
25 Wirklichkeit I-OTH O
26 O O
27 . O O

The example sentence, "Aufgrund seiner Initiative fand 2001/2002 in Stuttgart, Braunschweig und Bonn eine große und publizistisch vielbeachtete Troia-Ausstellung statt, Troia - Traum und Wirklichkeit. " contains five named entities: the locations Stuttgart, Braunschweig and Bonn, the noun including a location part Troia-Ausstellung, and the title of the event Troia - Traum und Wirklichkeit, which contains an embedded location name Troia

The sentence is encoded as one token per line, with information provided in tab-separated columns. The first column contains either a #, which signals the source the sentence is cited from and the date it was retrieved, or the token number within the sentence. The second column contains the token.
Name spans are encoded in the BIO-scheme. Outer spans are encoded in the third column, embedded spans in the fourth column. We have refrained from adding a third column for third-level embedded spans for the purpose of this evaluation, since they only occurred very rarely during annotation. See the paper [1] below for more information on the dataset and on the annotation guidelines.

TASK SETUP


We split the dataset [1] into training, development and test sets and provide the datasets in a tab-separated (TSV) format.
Datasets can be downloaded here:
  • Training Set
  • Development Set
  • Test Set (Available August 1, 2014 in unannotated form, from September 1, 2014 in annotated form)

Further, we provide an evaluation script (adopted from the CoNLL competitions) assessing a given TSV file against a gold standard, and a rationale of the evaluation, see Evaluation tab.

There is just one track -- Participants may use arbitrary knowledge sources to model the data.  Participants may submit up to three runs.

Submissions consist of a TSV file providing predictions for the test data and a paper of up to 4 pages (including references) describing the chosen approach and analyzing the performance. Papers should follow the KONVENS 2014 style files. The papers will be published online. We expect authors to present summaries of their systems at the KONVENS workshop.

[1] D. Benikova, C. Biemann, M. Reznicek. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. Proceedings of LREC 2014, Reykjavik, Iceland (attached below)
Ċ
Chris Biemann,
Mar 21, 2014, 2:57 AM
Ċ
Sebastian Pado,
Feb 18, 2014, 8:42 AM
ċ
Chris Biemann,
Jul 22, 2014, 2:38 AM
ċ
Chris Biemann,
Jul 31, 2014, 12:27 PM
ċ
Chris Biemann,
Sep 11, 2014, 4:19 AM
ċ
Chris Biemann,
Jul 22, 2014, 2:38 AM