We use two metrics for evaluation:
Lenient F1
Exact F1
The main score is the lenient F1. It is computed using a Jaccard Index, similarly as the one used in the 2013 BioNLP shared task [Bossy et al. 2013].
First, an optimal pairwise matching is performed between the reference and the predicted entities. The Jaccard Index is used as the similarity metric. The matching returns a number of substitutions S, a number of insertions I and a number of deletions D.
The Jaccard Index for a reference and a predicted entity is defined as follows:
J measures the ratio of intersection over union between the reference and the predicted entities. For a pair of entities with the exact same boundaries, J equals to 1.
We define the lenient Precision, Recall and F1 in the following way:
where đť“™ is the sum of the similarity J for all the pairs in the optimal matching, N is the total number of entities in the reference set and P is the number of entities in the prediction.
[Bossy et al. 2013] Bossy, R., Golik, W., Ratkovic, Z., Bessières, P., Nédellec, C.: BioNLP shared task 2013 –an overview of the bacteria biotope task. In: Proceedings of the BioNLP Shared Task 2013Workshop. pp. 161–169. Association for Computational Linguistics, Sofia, Bulgaria (Aug2013)
[Makhoul et al. 1999] Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance measures for information extraction. In: In Proceedings of DARPA Broadcast News Workshop. pp. 249–252(1999)