Evaluation metrics
For Subtask 1, we will use the metrics proposed by Piad-Morffis et al. (2020). This work defines correct, partial, missing, incorrect, and spurious matches to be applied in the span classification task.
Correct matches are reported when a text in the predicted file matches exactly with a corresponding text span in the gold file for start and end index, and also the 5W1H label. Only one correct match per entry in the gold file can be matched.
Incorrect matches are reported when the start and end index match, but not the type.
Partial matches are reported when two intervals [start, end] have a non-empty intersection, such as the case of “scientists” and “scientists specialised in biophysics” in the previous example (and matching the 5W1H label). Notice that a partial phrase will only be matched against a single correct phrase. For example, “researchers from the GPLSI Group” could be a partial match for both “researchers” and “the GPLSI Group”, but it is only counted once as a partial match with the word “researchers”. The words “the GPLSI Group” are counted then as Missing. This aims to discourage a few large text spans that cover most of the document from getting a very high score.
Missing matches are those that appear in the gold file but not in the predicted file.
Spurious matches are those that appear in the predicted file but not in the gold file.
From these definitions, we compute precision, recall, and a standard F1-Score as follows:
For Subtask 2, focusing on a classification task, the proposed evaluation metrics include: Accuracy, which measures the proportion of correctly classified news items. Precision and Recall assess the model's ability to correctly classify true positives against false positives and capture all actual positives, respectively. The F1-Score combines precision and recall, providing a harmonic mean that balances both metrics.