Evaluation
All submissions are evaluated automatically on the official Codabench server. Participants submit a single prediction file for the released public test set, and the public leaderboard reports one Overall Score.
Metrics
Submissions are ranked based on a normalized Overall Score, aggregating:
Accuracy: For single-choice questions (Cause & Handling, Factuality).
F1-Score: For defect existence detection.
MAE (Mean Absolute Error): Bounded and normalized for counting tasks.
Coming Soon