Evaluation

Overview

Participants will be provided training and test datasets. The evaluation for all tasks will be conducted using the withheld test data (test queries for Task 3). Participating teams are asked to stop development as soon as they download the test data. Teams are allowed to use any outside resources in their algorithms. However, system output for systems that use annotations outside of those provided for Tasks 1 and 2 will be evaluated separately from system output generated without additional annotations.

Each of the three tasks has at least one evaluation. Each team is allowed to upload up to two system runs for each evaluation in the three tasks, for a maximum of eight submissions (there are two evaluations for task 1). Some evaluations will provide scores for a relaxed and a strict measure. In addition, we will evaluate performance separately for system runs that use annotations outside of those provided; however system runs with outside annotations count towards the two runs per project.

Task 1 Evaluations
(a) Boundary detection of disorder named entities – identify the character spans of disorders. This evaluation is required for participation in Task 1
(b) Named entity recognition and normalization of SNOMED disorders – identify the character spans of disorders and map them to SNOMED codes. This evaluation is optional for participation in Task 1.

Task 2 Evaluations
Normalization of acronyms/abbreviations to UMLS codes

Task 3 Evaluations
Post submission relevance assessment will be conducted on the test queries to generate the complete result set. To do this, task participants will be asked to submit up to seven ranked runs
  • Run 1 (mandatory) is a baseline: only title and description in the query can be used, and no external resource (including discharge summary, corpora, ontology, etc) can be used.
  • Runs 2-4 (optional) any experiment WITH the discharge summaries .
  • Runs 5-7 (optional) any experiment WITHOUT the discharge summaries.
One of the runs from 2-4 and one from 5-7 has to use only the fields title and desc from the queries. The runs have to be ranked in order of priority (1-7, 1 being the highest priority).
Runs submitted have to follow TREC format.

Outcome Measures

Task 1 - Named entity recognition and normalization of disorders

A. Boundary detection of disorders: identify the span of all named entities that could be classified by the UMLS semantic group Disorder (excluding the semantic type Findings)

Evaluation measure: F1-score

F1-score = (2 * Recall * Precision) / (Recall + Precision)
Recall = TP / (TP + FN)
Precision = TP / (TP + FP)
TP = same span
FP = spurious span
FN = missing span

Exact F1-score: span is identical to the reference standard span
Overlapping F1-score: span overlaps reference standard span


B. Named entity recognition and normalization of disorders: identify the boundaries of disorders and map them to a SNOMED code

Evaluation measure: Accuracy

Accuracy = Correct/Total
Correct =  Number of disorder named entities with strictly correct span and correctly generated code
Total = Number of disorder named entities, depending on strict or relaxed setting:

Strict: Total = Total number of reference standard named entities. In this case, the system is penalized for incorrect code assignment for annotations that were not detected by the system.
Relaxed: Total = Total number of named entities with strictly correct span generated by the system. In this case, the system is only evaluated on annotations that were detected by the system.


Task 2  - Normalization of acronyms/abbreviations to UMLS codes

Evaluation measure: Accuracy

Accuracy = Correct/Total
Correct =  Number of pre-annotated acronyms/abbreviations with correctly generated code
Total = Number of pre-annotated acronyms/abbreviations

Strict Accuracy score: Correct = number of pre-annotated acronyms/abbreviations with the top code selected by the annotators (one best)
Relaxed Accuracy score: Correct = number of pre-annotated acronyms/abbreviations for which the code is contained in a list of possibly matching codes generated by the annotators (n-best)

Task 3 - Retrieval of web documents to address queries
Evaluation will focus on mean average precision (MAP), but other evaluation metrics such as precision at 10 (P@10) and other suitable IR evaluation measures will also be computed for the submitted runs.

Tools for Evaluation

Tasks 1 and 2
Two tools will be provided to perform evaluations on the training and test sets.

(1) Evaluation script: eval.pl
A perl evaluation script will calculate all outcome measures and print the results to a file. The results from the script will be used to rank all system runs within each task. The script requires as input the directory containing the pipe-delimited reference standard annotations and the directory containing files of the same format with system annotations. 

Parameters:

-input (prediction directory containing one pipe-delimited file per report)

-gold (goldstandard directory containing one pipe-delimited file per report)

-n (specify name of run)

-r (specify 1 or 2 for which run)

-t (specifiy 1a, 1b)

-a (optional - include if you used additional annotations)

Output file: name + run + task + add/noadd (e.g. myrun.1.1a.add)

Example:

perl eval.pl -n myrun -r 1 -t 1a -input /Users/wendyc/Desktop/CLEF/Task1TrainSetGOLD200pipe -gold /Users/wendyc/Desktop/CLEF/Task1TrainSetSystem200pipe -a

(2) Evaluation Workbench 

We are providing a GUI interface for calculation of outcome measures, as well as for visualization of system annotations against reference standard annotations. Use of the Evaluation Workbench is completely optional. Because the Evaluation Workbench is still under development, we would appreciate your feedback and questions if you select to use it.

A. Memory issues. You need to allocate extra heap when you run the workbench with all the files, or you will get an "out of memory" error.  To do so, you need to use a terminal (or shell) program, go to the directory containing the startup.parameters file, and type:

java -Xms512m -Xmx1024m -jar Eval*.jar 

B. Startup Properties file. The Evaluation Workbench relies on a parameter file called "startup.properties". Since the Workbench is a tool for comparing two sets of annotations, the properties refer to the first (or gold standard) and second (or system) annotators. The following properties will need to be set before running the Workbench:

WorkbenchDirectory:  Full filename where the executable (.jar) file is located. For example, WorkbenchDirectory=/Users/wendyc/Desktop/CLEF/EvaluationWorkbench

TextInputDirectory:  Directory containing the clinical reports (every document is a single text file in the directory). For example,
TextInputDirectory=/Users/wendyc/Desktop CLEF//EvaluationWorkbench/Task1TrainSetCorpus200EvaluationWorkbench

AnnotationInputDirectoryFirstAnnotator / AnnotationInputDirectorySecondAnnotator:  Directories containing the sets of annotations (gold standard annotations is first, system annotations is second). If you do not have system annotations but just want to view the gold standard annotations, point both input directories to the gold standard annotations.

AnnotationInputDirectoryFirstAnnotator=/Users/wendyc/Desktop/CLEF/Task1TrainSetGOLD200pipe

AnnotationInputDirectorySecondAnnotator=/Users/wendyc/Desktop/CLEF/Task1TrainSetSystem200pipe

Please remember to set pathnames appropriate for your operating system.  MacOS / Unix pathnames are in the form "/applications/EvaluationWorkbench/…", whereas Windows paths are in the form "c:\\Program Files\\Evaluation Workbench\\…" (escape characters included).  After setting paths appropriately for your computer and operating system, you can activate the Workbench by going to the distribution directory and using the mouse to double-click the EvaluationWorkbench.jar icon.

C. Short tutorial on Evaluation Workbench (5 minute video here: http://screencast.com/t/QzaMLwWwFe):

  • To open the workbench, double click on the EvaluationWorkbench.jar file
  • To navigate the Workbench, most operations will involve holding down the CTRL key until the mouse is moved to a desired position; once the desired position is reached, release the CTRL key. 
  • The Workbench displays information in several panes
    • Statistics pane: rows are classifications (e.g., Disorder CUI); columns display a contingency table of counts and several outcome measures (e.g., F-measure). The intersecting cell is the outcome measure for that particular classification. When a cell is highlighted, the reports generating that value are shown in the Reports pane. When you move the mouse over a report in the Reports pane, that report will appear in the Document pane.
    • The Document pane displays annotations for the selected document. The parameter button with label "Display=" selects whether to view a single annotation set at a time (gold or system), or to view both at once. Pink annotations are those that occur in only one source, and so indicate a false negative error (if it appears in the gold but not the system annotation set) or false positive (if it appears in the system but not the gold set). Highlighting an annotation in the document pane updates the statistics pane to reflect statistics for that classification. It also shows the attributes and relationships for that annotation (not relevant for this dataset but in other datasets you may have attributes like Negation status or relationships like Location of).
    • The Detail panel on the lower right side displays relevant parameters, report names, attribute, and relation information. The parameters include "Annotator" (whether the currently selected annotator is Gold or System), "Display" (whether you are viewing gold annotations, system annotations, or both), MatchMode (whether matches must be exact or any-character overlap) and MouseCtrl (whether the ctrl key must be held down to activate selections).
  • You can store the evaluation measures to a file by selecting File->StoreOutcomeMeasures, and entering a selected file name.
  • How to generate outcome measures for the tasks using the Workbench 
    • Task 1a - boundary detection without normalization
      • Select "Span" on the outcome measure pane to calculate outcome measures based only on boundary detection
      • Exact and overlapping span can be toggled by changing the MatchMode parameter
    • Task 1b strict - boundary detection and normalization
      • Select "exact" in the MatchMode parameter for all annotations
      • Select "Span&Class" on the outcome measure pane to calculate outcome measures requiring the same span and the same CUI code.
    • Task 1b relaxed - boundary detection and normalization
      • Select "exact" in the MatchMode parameter 
      • Select "Class" on the outcome measure pane to calculate outcome measures only on annotations that were correctly identified by the system (i.e., if the system did not identify the boundary of a disorder, that disorder will not be used in determining whether the CUI was correct or incorrect).
To participate in an electronic dialogue about use of the Workbench, please sign up for the google group: https://groups.google.com/forum/?fromgroups#!forum/evaluation-workbench

Task 3

Evaluation metrics can be computed with the trec_eval evaluation tool, which is available from http://trec.nist.gov/trec_eval/.
Comments