CRAFT Concept Annotation Task

Annotating full-text articles with concept mentions from 10 Open Biomedical Ontologies

Introduction

For the concept annotation task, participants will attempt to automatically annotate full-length biomedical journal articles of the CRAFT Corpus with classes either explicitly represented in or defined in terms of classes represented in select ontologies from the library of Open Biomedical Ontologies (OBOs), a large collaborative effort at formally representing information and knowledge across the biomedical domain as well as overlapping and related ones.

All training and testing articles have been annotated by making use of 10 OBOs:

  • Chemical Entities of Biological Interest (CHEBI)
  • Cell Ontology (CL)
  • Gene Ontology Biological Process (GO_BP)
  • Gene Ontology Cellular Component (GO_CC)
  • Gene Ontology Molecular Function (GO_MF)
  • Molecular Process Ontology (MOP)
  • NCBI Taxonomy (NCBITaxon)
  • Protein Ontology (PR)
  • Sequence Ontology (SO)
  • Uberon (UBERON)

Descriptions of the contents and usages of these ontologies as well as example classes used for annotation may be examined here.

For each of these ten ontologies, two annotation sets have been created. The first, called the core set, consists solely of annotations made with proper classes of the original given OBO. The second, called the core+extensions set, consists of annotations made with these proper OBO classes as well as annotations made with what we refer to as extension classes, which are classes we have created as extensions of the ontologies, but defined in terms of proper ontology classes. These extension classes have been created for various reasons, including semantic unification of classes from different ontologies, unification of multiple classes that were difficult to reliably use for annotation, creation of semantic abstractions to annotate correspondingly abstract textual mentions, and representation of corresponding types of concepts that we found easier to use for annotation compared to their original forms. An extension class is identifiable by its namespace prefix, which always ends in "_EXT". More documentation about the extension classes may be found in the link mentioned above.

Subtasks

There will be two subtasks for the Concept Annotation task. The mechanics of the tasks themselves are identical as are the evaluation metrics; however, the input data will differ between them. Please read through this document to understand the differences between the core set and the core+extensions set of annotations.

1. CRAFT-CA-CORE

The goal of this task will be to identify mentions of concepts from the core set (described above) which includes proper classes from the original 10 Open Biomedical Ontologies.

2. CRAFT-CA-EXTENSIONS

As its name suggests, the goal of the second subtask will be to identify mentions of concepts from the core+extensions set (described above) which includes proper classes from the original 10 Open Biomedical Ontologies and mentions of extension classes that have been defined in terms of the proper ontology classes.


Description of Data

Development Data

The training data for this task consist of the already publicly released concept annotation sets for the 67 articles of the CRAFT Corpus v3.1.3. The development data is available in version 3.1.3 of the CRAFT distribution.

Evaluation Data

The testing data for this task consists of the concept annotation sets for 30 articles of the corpus that have not yet been publicly released. These articles have been annotated in exactly the same ways as the publicly released document set.

Evaluation Metrics

The evaluation measures for the CRAFT-CA task will be Slot Error Rate and F-score of the predicted ontology concept mentions against the manually annotated mentions of the CRAFT evaluation data set as defined in Bossy et al. 2013. Annotations will be evaluated on a per-ontology basis. Participants may choose to submit runs for any or all of the ontologies.

Bossy et al. (2013) BioNLP shared Task 2013 -- An Overview of the Bacteria Biotope Task. Proc BioNLP Shared Task 2013 Workshop. 161-169

The evaluation platform is now available in the CRAFT Shared Task GitHub repository. See the wiki for installation and usage instructions.

Input/Output

For final submissions, users will be provided a plain text version of each document. The ontologies provided with the development data are the same as those used for the evaluation data. Submissions of results for the CRAFT-CA task must use the BioNLP standoff annotation format. Submissions should include one file per document using the same naming scheme as the training data. Files should be grouped in directories according to the ontology used, and the directories must be named according to the ontology keys defined by the evaluation platform. For details see the evaluation platform wiki.

Additional Resources

For a description of additional resources provided for this task, please visit this page.