Evaluation

Submissions will be assessed using a combination of automated metrics and expert evaluation.

Task 1: Causal Explanation Generation Evaluation

We will use a multi-faceted evaluation framework successfully validated in NTCIR-18.

Automatic Evaluation (80%):
- Semantic Similarity:
  - BERTScore: Measures contextual semantic similarity at the token level.
  - BioSentVec: Assesses clinical semantic similarity using sentence embeddings specialized for the biomedical domain.
- LLM based Evaluation:
  - LLM-White: Evaluates the contextual and semantic overlap with the ground-truth explanation. Refer to https://github.com/hidden-rad/Evaluation-Scheme-Experiment- for NTCIR-18 Hidden-Rad task.
  - LLM-Black: Comprehensively evaluates the internal completeness, logical flow, and causal validity of the generated text using an internal scoring system (with bonuses and penalties).
Qualitative Evaluation by Experts (20%):
- Radiologists will manually review top-performing submissions to assess their clinical validity, readability, and the completeness of the explanation.

Task 2: Causal Verification and Correction Evaluation

Detection Performance: Measured using Precision, Recall, and F1-score to evaluate how accurately systems identify errors.
Correction Quality: The semantic and clinical accuracy of the corrected text will be measured using BERTScore, BioSentVec, and GPT-based evaluations.
Confidence Calibration: Evaluates how well the model's self-reported confidence score aligns with its actual accuracy.

Page updated

Google Sites

Report abuse