Evaluation metrics

For Subtask 1, we will use the metrics proposed by Piad-Morffis et al. (2020). This work defines correct, partial, missing, incorrect, and spurious matches to be applied in the span classification task. 

From these definitions, we compute precision, recall, and a standard F1-Score as follows: 

For Subtask 2, focusing on a classification task, the proposed evaluation metrics include: Accuracy, which measures the proportion of correctly classified news items. Precision and Recall assess the model's ability to correctly classify true positives against false positives and capture all actual positives, respectively. The F1-Score combines precision and recall, providing a harmonic mean that balances both metrics.