CRAFT Coreference Resolution Task

Extracting identity chains for all base noun phrases in full-text biomedical articles

Introduction

Coreference relations link strings of text that have the same referent. For the purposes of the CRAFT shared task, these strings of text (refered to as 'mentions' below) must exist in the same document, but can be localized any distance from one another. Some mentions may be found to be adjacent while others may exist only in the document title and conclusion, for example. Because each mention may not explicitly represent the referent, e.g. pronouns, resolving coreference can improve downstream tasks, such as information extraction, that require explicit knowledge of the referent. Two types of coreference have been resolved for all base noun phrases in the CRAFT corpus. Identity chains link mentions of the same referent, and can span the entire document. Apposition relations link adjacent noun phrases that have the same referent and are not linked by a copula. The CRAFT Shared Task on coreference resolution will focus on reproducing the manually curated identity chains. For further details on the definition of identity chains, and for information on the coreference annotations in CRAFT, please see:

Cohen, K.B., Lanfranchi, A., Choi, M.J., Bada, M., Baumgartner Jr., W.A., Panteleyeva, N., Verspoor, K., Palmer, M., and Hunter, L.E. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372. [link]


The Task

The goal of the CRAFT Coreference Resolution task (CRAFT-CR) is to automatically reproduce identity chains in the documents of the CRAFT corpus. The challenges specific to this coreference task lie in the fact that identity chains can span the length entire documents, and that the base noun phrase mentions linked in these chains are permitted to have discontinuous spans (see example below).

Description of Data

Development Data

The training data for this task consist of the already publicly released identity chain annotations in the 67 articles of the CRAFT Corpus v3.1.3.

Evaluation Data

The testing data for this task consist of identity chains in 30 articles of the corpus that have not yet been publicly released. These articles have been annotated in exactly the same ways as the publicly released document set.

Evaluation Metrics

Evaluation of the CRAFT-CR task will an adaption of the evaluation code used for the CoNLL Coref 2011/12 shared tasks. The original evaluation code has been modified in two ways:

1. The code has been modified to handle mentions with discontinuous spans. The original code required mentions to have only contiguous spans.

2. The code has been modified to allow for partial mention matches, i.e. to allow mentions that overlap to match.

Coreference resolution system performance will be evaluated on all 6 metrics provided:

  • muc: MUCScorer (Vilain et al, 1995)
  • bcub: B-Cubed (Bagga and Baldwin, 1998)
  • ceafm: CEAF (Luo et al., 2005) using mention-based similarity
  • ceafe: CEAF (Luo et al., 2005) using entity-based similarity
  • blanc: BLANC (Luo et al., 2014) BLANC metric for gold and predicted mentions
  • lea: LEA (Moosavi and Strube, 2016) Link-based Entity Aware Metric

The evaluation platform is now available in the CRAFT Shared Task GitHub repository. See the wiki for installation and usage instructions.

Input/Output

For final submissions, users will be provided tokenized versions each document. The evaluation script cannot handle differences in tokenization, thus gold-standard tokens will be provided as input to your system. Input and output files for the CRAFT-CR task will both use the CoNLL Coref 2011/12 file format, with a modification to allow for mentions with discontinuous spans.

Discontinuous mentions are denoted by the addition of a character or characters (non-digit) after the chain identifier (integer). In the example below, all spans denoted by 32a belong to the same discontinuous mention that is a member of identity chain #32. The next discontinuous mention in identity chain #32 (if there is one), would be denoted by 32b. There is no constraint on the number of characters that can be used to identify a mention, e.g. 32aaaaaaa is perfectly valid. The only constraint is that the only integer in the mention identifier must apply to the identity chain, and that the integer must occur at the beginning of the identifier string, e.g. aaaaaa32 is not a valid mention identifier, nor is 32abc123.

Example of the CoNLL Coref 2011/12 file format with discontinuous span handling

There are three mentions in the sentence below extracted from document PMC194730 (PMID: 12925238). Each mention is a member of a different identity chain. Coreference information is located in the final column of this data file format.

1. "The protein"

    • a contiguous mention that starts at token 1 and ends at token 2 (inclusive)
    • a member of identity chain #2

2. "a variable N-terminal .. domain"

    • a discontinuous mention that includes tokens [16-20] and token 27
    • a member of identity chain #32

3. "a conserved C-terminal domain"

    • a contiguous mention that includes tokens [22-27]
    • a member of identity chain #33


194730 0 1 The DT - - - - - - - (2194730 0 2 protein NN - - - - - - - 2)194730 0 3 belongs VBZ - - - - - - - -194730 0 4 to IN - - - - - - - -194730 0 5 a DT - - - - - - - -194730 0 6 family NN - - - - - - - -194730 0 7 of IN - - - - - - - -194730 0 8 evolutionarily RB - - - - - - - -194730 0 9 conserved VBN - - - - - - - -194730 0 10 proteins NN - - - - - - - -194730 0 11 of IN - - - - - - - -194730 0 12 a DT - - - - - - - -194730 0 13 bipartite JJ - - - - - - - -194730 0 14 structure NN - - - - - - - -194730 0 15 with IN - - - - - - - -194730 0 16 a DT - - - - - - - (32a194730 0 17 variable JJ - - - - - - - -194730 0 18 N NN - - - - - - - -194730 0 19 - HYPH - - - - - - - -194730 0 20 terminal JJ - - - - - - - 32a)194730 0 21 and CC - - - - - - - -194730 0 22 a DT - - - - - - - (33194730 0 23 conserved VBN - - - - - - - -194730 0 24 C NN - - - - - - - -194730 0 25 - HYPH - - - - - - - -194730 0 26 terminal JJ - - - - - - - -194730 0 27 domain NN - - - - - - - (32a)|33)194730 0 28 . . - - - - - - - -