Evaluation results
Participating teams
Ten teams participated to all six sub-tasks and submitted a total of 31 runs.
- AliAI: rel, rel+ner
- Amrita_Cen: rel (2)
- AmritaCen_healthcare: norm, norm+ner
- BLAIR_GMU: kb (2), kb+ner (2), norm (2), norm+ner (2), rel (2), rel+ner (2)
- BOUN-ISIK: norm (2), rel (2)
- MIC-CIS: norm+ner (2)
- PADIA_BacReader: norm
- UTU: rel (2), rel+ner(2)
- whunlp: rel
- Yuhang_Wu: rel
Evaluation
You may check and evaluate your predictions:
- on the training and development sets with the evaluation software;
- on the test set with the online evaluation service.
The evaluation is described on the BioNLP-ST 2016 Bacteria Biotope page.
You can download all the charts and tables shown below (BB-rel, BB-rel+ner, BB-norm, BB-norm+ner, BB-kb, BB-kb+ner).
Baseline
Submissions are compared to a simple baseline:
- Case-insensitive string matching for NER and normalization.
- All valid pairs of arguments inside a sentence for relation extraction.
Confidence intervals
The confidence interval has been obtained by bootstrap resampling (n=100).
BB-rel
7 teams, 11 runs.
Global results
Recall, Precision and F1 for both relation types.
Lives_In
Recall, Precision and F1 for Lives_In
relations.
The ticks on top of each bar indicates the score for relations that do not cross sentence boundaries.
Exhibits
Recall, Precision and F1 for Exhibits
relations.
The ticks on top of each bar indicates the score for relations that do not cross sentence boundaries.
BB-rel+ner
3 teams, 5 runs.
Slot Error Rate
The Slot Error Rate (SER) is shown instead of F1, because substitution errors are penalized both in Recall and Precision.
SER is an error rate, therefore lower values are better.
Named entity boundaries
Named-entities boundaries accuracy is measured by the Jaccard index.
Global results
Recall, Precision and SER for both relation types.
Lives_In (Habitat)
Recall, Precision and SER for Lives_In
relations where the argument is of type Habitat
.
The tick on each bar indicates the gain when entity boundaries accuracy is ignored.
Lives_In (Geographical)
Recall, Precision and SER for Lives_In
relations where the argument is of type Geographical
.
The tick on each bar indicates the gain when entity boundaries accuracy is ignored.
Exhibits
Recall, Precision and SER for Exhibits
relations.
The tick on each bar indicates the gain when entity boundaries accuracy is ignored.
BB-norm
4 teams, 6 runs.
Global results
The result is the average distance between predicted and reference normalizations.
For Microorganism
entities, a strict equality is used.
For Habitat
and Phenotype
entities, the Wang distance is used (w=0.65).
Microorganisms
Average of strict equality of normalizations for Microorganisms
entities.
Habitats
Average Wang distance of normalizations for Habitat
entities.
Habitats (exact)
Average strict equality of normalizations for Habitat
entities.
Habitats (new in test)
Average Wang distance of normalizations for Habitat
entities. Only normalizations with concepts absent from the training and development set were considered.
Phenotypes
Average Wang distance of normalizations for Phenotype
entities.
Phenotypes (exact)
Average strict equality of normalizations for Phenotype
entities.
Phenotypes (new in test)
Average Wang distance of normalizations for Phenotype
entities. Only normalizations with concepts absent from the training and development set were considered.
BB-norm+ner
3 teams, 5 runs.
Slot Error Rate
The Slot Error Rate (SER) is shown instead of F1, because substitution errors are penalized both in Recall and Precision.
SER is an error rate, therefore lower values are better.
Named entity boundaries
Named-entities boundaries accuracy is measured by the Jaccard index.
Global results
Recall, Precision, and SER for all entities.
The score for each individual entity is the product of boundaries accuracy (Jaccard) and normalization (BB-norm).
Microorganisms
Results for Microorganism
entities only (Jaccard . Equality).
Habitats
Results for Habitat
entities only (Jaccard . Wang).
Phenotypes
Results for Phenotype
entities only (Jaccard . Wang).
Microorganisms NER
Results for Microorganism
entities boundary accuracy (Jaccard).
Habitats NER
Results for Habitat
entities boundary accuracy (Jaccard).
Phenotypes NER
Results for Phenotype
entities boundary accuracy (Jaccard).
BB-kb and BB-kb+ner
1 team, 2 runs.
The evaluation emulates the capacity of systems to populate databases from a corpus. The pairs of database references (NCBI and OntoBiotope) are evaluated regardless of their text-bound anchors or of their corpus redundancy.
The Mean References is the average of the Wang similarity (w=0.65) of the OntoBiotope argument.