Evaluation

The evaluation of performance on this task will involve several steps. First, we will assess performance on the original NLI4CT statements without any interventions. This assessment will be based on Macro F1-score.

Next, we will evaluate performance on the contrast set, which includes all statements with interventions. For this evaluation, we will use two new metrics: faithfulness and consistency, which are defined below. The overall ranking of a system will be determined by calculating the average faithfulness and consistency scores across all intervention types.

Faithfulness is a measure of the extent to which a given system arrives at the correct prediction for the correct reason. Intuitively, this is estimated by measuring the ability of a model to correctly change its predictions when exposed to a semantic-altering intervention. Given N statements x_i in the contrast set (C), their respective original statements y_i, and model predictions f() we compute faithfulness using Equation 1.

Consistency a measure of the extent to which a given system produces the same outputs for semantically equivalent problems. Therefore, consistency is measured as the ability of a system to predict the same label for original statements and contrast statements for semantic preserving interventions. That is even if the final prediction is incorrect, the representation of the semantic phenomena is consistent across the statements. Given N statements x_i in the contrast set (C), their respective original statements y_i, and model predictions f() we compute faithfulness using Equation 2.

Evaluation data

During the 'practice' phase, the prediction files submitted by participants to the task will be evaluated against the gold practice test set.

During the 'evaluation' phase, the prediction files submitted by participants to the task will be evaluated against the gold test set.