The evaluation scheme is designed separately to assess retrieval effectiveness and the accuracy of veracity predictions.
Subtask 1 Evaluation:
Evidence retrieval will be evaluated using Success@3 and nDCG@3. The Success@3 measures whether at least one relevant piece of evidence appears among the top k retrieved results. The nDCG@3 measures how well the retrieval system orders the item, prioritizing highly relevant results at the top.
Veracity prediction will be evaluated using macro F1-score over the "SUPPORTS" and "REFUTES" classes.
Subtask 2 Evaluation:
The primary metric will be the macro F1-score for veracity prediction.
Generated evidence summaries and justifications will be manually evaluated.