Submissions will be assessed using a combination of automated metrics and expert evaluation.
Task 1: Causal Explanation Generation Evaluation
We will use a multi-faceted evaluation framework successfully validated in NTCIR-18.
Automatic Evaluation (80%):
Semantic Similarity:
BERTScore: Measures contextual semantic similarity at the token level.
BioSentVec: Assesses clinical semantic similarity using sentence embeddings specialized for the biomedical domain.
LLM based Evaluation:
LLM-White: Evaluates the contextual and semantic overlap with the ground-truth explanation. Refer to https://github.com/hidden-rad/Evaluation-Scheme-Experiment- for NTCIR-18 Hidden-Rad task.
LLM-Black: Comprehensively evaluates the internal completeness, logical flow, and causal validity of the generated text using an internal scoring system (with bonuses and penalties).
Qualitative Evaluation by Experts (20%):
Radiologists will manually review top-performing submissions to assess their clinical validity, readability, and the completeness of the explanation.
Task 2: Causal Verification and Correction Evaluation
Detection Performance: Measured using Precision, Recall, and F1-score to evaluate how accurately systems identify errors.
Correction Quality: The semantic and clinical accuracy of the corrected text will be measured using BERTScore, BioSentVec, and GPT-based evaluations.
Confidence Calibration: Evaluates how well the model's self-reported confidence score aligns with its actual accuracy.