Task 2: Multilingual Information Extraction

News: Test set submission guidelines updated May 2

Task 2 continues the exploration of the exciting topic of named entity recognition in written text with focus on unexplored languages corpora, specifically French this year. This builds upon the 2015 task which already addressed the analysis of French biomedical text; in 2016, participants have the opportunity to perform named entity recognition and normalization on a new dataset of scientific article titles and full-text drug inserts. In addition, participants will also be challenged with the extraction of causes of death from a new corpus of French death reports. This new task can naturally be treated as a named entity recognition and normalization task, but also as a text classification task. Only fully automated means are allowed, that is, human-in-the-loop approaches are not permitted.


Targeted Participants

The task is open for everybody. We particularly welcome academic and industrial researchers, scientists, engineers and graduate students in natural language processing, machine learning and biomedical/health informatics to participate. We also encourage participation by multi-disciplinary teams that combine technological skills with clinical expertise.


Data Set

The first data set is called QUAERO FrenchMedical Corpus. It has been developed as a resource for named entity recognition and normalization in 2013 (Névéol et al. 2014). It was previously used in CLEF eHealth 2015. The data released in 2015 is used as a training set and a new unseen test set will be released to CLEF eHealth 2016 participants. It comprises annotations for 10 types of clinical entities with normalization to the Unified Medical Language System (UMLS) and covers scientific articles titles and drug inserts.

The second data set is called the CepiDC Causes of Death Corpus. It comprises free-text descriptions of causes of death as reported by physicians in the standardized causes of death forms. Each document was manually coded by experts with ICD-10 per international WHO standards. 


QUAERO Training Data

The QUAERO training data can be downloaded from https://clef2015.limsi.fr/train2016/CLEFeHealth2016-task2_train.zip. The credentials needed to access the data are supplied to CLEF eHealth 2016 Task 2 registered participants upon request to the Task 2 organizers.

The data set includes the following documents:

        - Train folder: 833 annotated MEDLINE titles, and 3 annotated EMEA documents; 3 BRAT configuration files
        - Devel folder: 832 annotated MEDLINE titles, and 3 annotated EMEA documents; 3 BRAT configuration files
        - Eval folder: the java source of the tool that will be used for evaluation (brateval)


CépiDC Training Data

To obtain the CepiDC training Data, participants need to fill the data use agreement (in French). The Data use agreement is a binding document that requires participants to:

  • use the corpus for research purposes only;

  • submit the results of their system to the CLEF eHealth 2016 lab, including a description of their system that will appear in the lab Working Notes;

  • Not redistribute or otherwise disclose the contents of the corpus to any unauthorized third party. Only short citations are permitted in publications for illustration purposes.

    The short paragraph above summarizes the contents of the data use agreement in English; please note that we provide this text for information purposes only, and that the full document in French is legally binding.

To enter into the data use agreement, participants should print and fill the document as follows:

  • On page 1, enter your personal information:
    • Nom de l’organisme, du laboratoire ou de la société participant : Name of your organisation, lab, or company

    • Nom : Last Name

    • Prénom : First Name

    • Fonction : Job Title/position

    • Adresse : Address

    • Téléphone : Phone number (please include all relevant area codes)

    • E-mail : email address

  • On page 3, in the line that follows « Pour le Participant : », replace « Mr/Mme Prénom NOM : e-mail » with your title, full name and valid email address (this must match the information entered on page 1).

  • At the bottom of page 5, date and sign

  • send a pdf copy of the agreement to Ms Aude Robert aude.robert@inserm.fr

After Ms. Robert has verified that the data use agreement was adequately filled, participants will receive the credentials to access the data at the email address supplied in the agreement.

The data set includes the following documents:

  • Corpus folder: 65,843 death certificate processed by CépiDC over the period 2006-2012. The corpus is supplied in csv format and each row contains twelve information fields associated with a raw line of text from an original death certificate.

  • Dictionary folder: 4 versions of a manually curated ICD10 dictionary developped at CépiDC

  • Eval folder: the perl source of the tool that will be used for evaluation

  • README.txt : a text document describing the data into further detail


French Version of ICD-10

If you are on the Swiss territory, you can use the French version of ICD-10 which can be downloaded from the following address:

http://www.bfs.admin.ch/bfs/portal/fr/index/infothek/nomenklaturen/blank/blank/cim10/02/05.html

http://www.bfs.admin.ch/bfs/portal/fr/index/infothek/nomenklaturen/blank/blank/cim10/02/05.Document.187049.zip


After unzipping the above archive, you can extract a list of ICD codes and associated preferred terms using the following Unix command:

cut -d\; -f8,9 CIM10GM2014_ASCII_S_FR_versionmВtadonnВe_codes_20141031.txt
 

Test Set and Submission Guidelines

The test data for CLEF eHealth 2016 Task 2 will be released in April 2016. Submissions will be expected in the same format as the training data. Participants may choose to work with the QUAERO data, CépiDC data, or both. Participants working with the QUAERO data may choose to submit runs for all or any combination of phase/dataset. Participants working with the CépiDC data are bound to submit at least one run for the task as per data use agreement.

Runs should be submitted using the EasyChair system at: https://easychair.org/conferences/?conf=clefehealth2016runs

Guidelines for the QUAERO corpus

Test documents in the test set are part of the same corpus as the training documents, and were annotated using the same guidelines by the same annotators.
Teams are invited to submit runs in two subsequent phases, one for entity recognition, one for entity normalization.

Phase 1: entity recognition (data release scheduled May 2, 2016)

- Only text files are supplied.
Sample text file:

La contraception par les dispositifs intra utérins

- Teams need to supply entity annotations in the BRAT standoff format used in the training data set.
Sample entities output expected for the above sample text file:
T1       PROC 3 16   contraception
T2       DEVI 25 50   dispositifs intra utérins
T3       ANAT 43 50  utérins
- Additionnally, teams may supply normalization information for the entity annotations, also in the BRAT format.

Sample normalized entities output expected for the above sample text file:

T1       PROC 3 16   contraception
#1       AnnotatorNotes T1  C0700589

T2       DEVI 25 50   dispositifs intra utérins

#2       AnnotatorNotes T2  C0021900

T3       ANAT 43 50  utérins

#3       AnnotatorNotes T3  C0042149

When submitting, the teams must select all the relevant tracks for their submission (due May 6, 2016):
2.Q.1: MEDLINE entities,
2.Q.2: EMEA entities,
2.Q.3: MEDLINE normalized entities,
2.Q.4: EMEA normalized entities.

Phase 2: entity normalization
(data release scheduled May 9, 2016)
- Text files and gold-standard entity annotations are supplied.
Sample text file:
La contraception par les dispositifs intra utérins
Sample entities annotations supplied along with the above sample text file:
T1       PROC 3 16   contraception
T2       DEVI 25 50   dispositifs intra utérins
T3       ANAT 43 50  utérins

- Teams need to supply supply normalization information for the gold standard entity annotations, in the BRAT standoff format.
Sample normalization output expected for the above sample text file and gold standard entity annotations:
T1       PROC 3 16   contraception
#1       AnnotatorNotes T1  C0700589

T2       DEVI 25 50   dispositifs intra utérins

#2       AnnotatorNotes T2  C0021900

T3       ANAT 43 50  utérins

#3       AnnotatorNotes T3  C0042149

When submitting, the teams must select all the relevant tracks for their submission (due May 13, 2016):

2.Q.5: MEDLINE normalization,
2.Q.6: EMEA normalization,

Each team is allowed to submit up to 2 runs for each track.   

Guidelines for the CépiDC corpus (test data release scheduled May 6, 2016)

The CépiDC task consists of extracting ICD10 codes from the raw lines of death certificate text. The process of identifying a single ICD code per certificate as the « primary cause » of death may build on the task, but is not evaluated here. The task is an information extraction task that relies on the text supplied to extract ICD10 codes from the certificates, line by line.

Sample text line :

Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse)

ICD codes expected to be associated with this text line :

G200
R600

The sample text is given with associated metadata (9 fields) :

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL

The output will be expected in the following format:

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL;Maladie de Parkinson  idiopathique;maladie Parkinson idiopathique;G200

80147;2013;2;85;4;5;Maladie de Parkinson idiopathique Angioedème membres sup récent non exploré par TDM (à priori pas de cause médicamenteuse);NULL;NULL;Angioedème membres sup;oedème membres supérieurs;R600

The output comprises the 9 input fields plus two text fields that the participants must use to report evidence text supporting the ICD10 code supplied in the twelfth, final field. The tenth field should contain the excerpt of the original text that supports the ICD code prediction. If the system uses a dictionary or other lexical resource linking to ICD10 codes (including the dictionaries supplied by the organizers), the eleventh field should be the dictionary entry that supports the ICD code prediction

Please note that in some cases, there is no ICD10 code associated with a given text line. In other cases, the ICD10 codes associated with a given line use the context provided in other lines of the same certificate.

Sample text :

14;2007;1;40;5;2;pendaison;NULL;NULL
14;2007;1;40;5;3;suicide ?;NULL;NULL

Sample codes associated :

14;2007;1;40;5;2;pendaison;NULL;NULL;;;
14;2007;1;40;5;3;suicide ?;NULL;NULL;2-1;suicide pendaison;X709

Participants are free to use the same system to analyze both the QUAERO and CépiDC corpus if desired. If that is the case, please state it explicitly in your method description.

**

When submitting for this task, the teams must select the relevant track for their submission (due May 13, 2016):
2.C: CépiDC coding,

Replication track (submissions due May 20, 2016)

The goal of this track is to promote the sharing of tools and the dissemination of solid, reproducible results.
Participation in the replication track is open to all teams who submit results on the QUAERO or CepiDC test sets.

After submitting their result files, participating teams will have one extra week to submit the system used to produce them, or a remote access to the system, along with instructions on how to install and operate the system.
The replication track will consist in attempting to replicate a team's results by running the system supplied on the test data sets, using the team's instructions.

Depending on the number of system submissions received, one or more system analysts will spend a maximum of one working day (8 hours) with each system. The analysts will install and configure the system according to the instructions supplied (teams may also supply a contact address and make themselves available to address any additional questions). The analysts will run the systems on the appropriate CLEF eHealth task 2 test sets. The results obtained will be compared to those submitted by the teams using the same system. During this process, the analysts will make notes on the various aspects of working with the systems: ease of installing and using, ease of understanding supplied instructions, success of the replication attempt. The results of this replication study will be announced along with the other CLEF eHealth results.

Each submission must consist of the following items:

    Address for Correspondence: address, city, post code, (state), country
    Author(s): first name, last name, email, country, organisation
    Title: Instead of entering a paper title, please specify your team name here. A good name is something short but identifying. For example, Mayo, Limsi, and UTHealthCCB have been used before. If your team also participated in CLEF eHealth 2016 task 1 or task 3, we ask that you please use the same team name for your task 2 submission.
    Keywords: Instead of entering three or more keywords to characterize a paper, please use this field to describe your methods/compilations. We encourage using MeSH or ACM keywords.
    Topics: please tick all relevant tracks among QUAERO, CépiDC and Replication
. reflecting the runs you are submitting.
    ZIP file: This file is an archive containing two files, and several folders with the results of your runs, organized as follows: 

       file 1: Team description as team.txt (max 100 words): Please write short general description of your team. For example, you may report that "5 PhD students, supervised by 2 Professors, collaborated" or "A multi-disciplinary approach was followed by a clinician bringing in content expertise, a computational linguist capturing this as features of the learning method and two machine learning researchers choosing and developing the learning method".

        file 2: Method description as methods.txt (max 100 words per method): Please write a short general description of the method(s) used for each run. Please include the following information in the description: 1/whether the method used was a/statistical, b/symbolic (expert or rule-based), c/ hybrid (i.e. a combination of a/ and b/) 2/whether the method used the training data supplied (EMEA, MEDLINE or both) 3/whether the method used outside data such as additional corpus, annotations on the training corpus, or lexicons, and a brief description or these ouside resources including whether the data is public. 

        Folder 1: Runs for tracks 2.Q.1, 2.Q.3 and 2.Q.5 should be stored in a folder called MEDLINE.
        Runs for track 2.Q.1 should be stored in a subfolder called entities, with in turn one subfolder for each run: run1 and run2.
        Runs for track 2.Q.3 should be stored in a subfolder called normalizedEntities, with in turn one subfolder for each run: run1 and run2.
        Runs for track 2.Q.5 should be stored in a subfolder called normalization, with in turn one subfolder for each run: run1 and run2.
        Each run folder should contain 833 BRAT standoff format .ann files with the results of your system for the run. Please make sure this is formatted in the BRAT standoff annotation format described here  and exemplified in the training dataset. We recommend running the evaluation tool on your data to check format compliance. 
       
        Folder 2: Runs for tracks
2.Q.2, 2.Q.4 and 2.Q.6 should be stored in a folder called EMEA.
        Runs for track
2.Q.2 should be stored in a subfolder called entities, with in turn one subfolder for each run: run1 and run2.
        Runs for track
2.Q.4 should be stored in a subfolder called normalizedEntities, with in turn one subfolder for each run: run1 and run2.
        Runs for track
2.Q.6 should be stored in a subfolder called normalization, with in turn one subfolder for each run: run1 and run2.
        Each run folder for task 2.Q should contain 15 BRAT standoff format .ann files with the results of your system for the run: Please make sure this is formatted in the BRAT standoff annotation format described here and exemplified in the training dataset. We recommend running the evaluation tool on your data to check format compliance. 

        Folder 3: Runs for track 2.C should be stored in a folder called CepiDC.
        Each run should consist of a single .csv file containing twelve fields: the original 9 fields supplied as input and three additional fields: two optional text fields containing supporting evidence and one field with the extracted ICD10 codes. If you choose not to supply supporting information, the corresponding text fields must be empty. The files corresponding to each run should be named run1.csv and run2.csv.
We recommend running the evaluation tool on your data to check format compliance.

Do not hesitate to post questions and comments on the task mailing list (clefehealth2016-task2@limsi.fr) if you require further clarification or assistance.
Thank you for your participation!

Evaluation Methods

For the QUAERO test set, results will be evaluated with the brateval program supplied with the training data. System performance will be assessed by precision, recall and F-measure for entity recognition and entity normalization.

For the CépiDC test set, results will be evaluated with the evaluation program supplied with the training data.
System performance will be assessed by precision, recall and F-measure for ICD code extraction at the line level.


Registration and Data Access

  1. Please register on the main CLEF 2016 registration page
  2. Please join the task 2 mailing list (at https://sympa.limsi.fr/wws/info/clefehealth2016-task2) or contact the task organizers to receive credentials to access the task datasets.

Timeline

  • Training set release: QUAERO: December 2016, CépiDC: January 2016
  • Test set release: Early May 2016
  • Result submission: May 2016 (see details above)

Contact Information

The best (and maybe the fastest) way to get your questions answered is joining the clef-ehealth mailing lists: