ChemoTimelines 2025 - Evaluation

Chemotherapy Treatment Timelines Extraction

from the Clinical Narrative:

Evaluation

Evaluation Process and Details

Participants may participate in either or both subtasks. Evaluation will be done against a held-out test set. The test set will be made available only for a short period of time. Participants will be instructed to submit the output of their systems in a format specified by the organizers (see Submission of Test Output). The organizers will run the evaluation script distributed with the train and development data to produce the final results.

System-extracted timelines will be evaluated against the gold patient-level timelines by comparing the <'EVENT', 'temporal_relation', 'TIMEX3'> tuples to the gold timelines. F1 score is computed at the patient level as the macro F1 score, i.e. the average F1 score across all patients.

There are two evaluation settings:

Strict evaluation, in which case the Systemic anticancer therapy (SACT) EVENT, TIMEX3, and the temporal relation between them must match exactly the gold standard to be counted as a match. For example, if the system predicted ['taxol', 'contains-1', '2013-06-17'], but the gold tuple is ['taxol', 'begins-on', '2013-06-17'], the system prediction would not be a match because the values of the temporal relations are different. Please note that we use strict evaluation results as the official metric for the leader board.

Relaxed evaluation, in which case we consider 'contains-1' and 'begins-on', 'contains-1' and 'ends-on' interchangeable. Therefore, ['taxol', 'contains-1', '2013-06-17'] and ['taxol', 'begins-on', '2013-06-17'] is counted as a match. In this setting, if system predictions fall into the correct timeframe of the gold timeline annotations, they are are also counted as a match. For example, if the system predicts ['taxol', 'contains-1', '2013-06-17'], and there are ['taxol', 'begins-on', '2013-03'] and ['taxol', 'ends-on', '2013-09'] in the gold timeline annotations, the system prediction will be counted as a match because '2013-06-17' falls in the gold timeframe, i.e. from '2013-03' to '2013-09'.

In the relaxed evaluation, additional evaluation settings are “relaxed to month” and “relaxed to year”, where only the month and the year, or only the year need to match the gold annotation respectively.

Evaluation Code

The evaluation code is available here: https://github.com/HealthNLPorg/chemoTimelinesEval.

The expected input to the evaluation code is the json format described in Submission of Test Output.

Official Evaluation Metric

For the ChemoTimelines shared task 2025, we are using STRICT metric to rank the submissions. The metric for the 1st edition of the shared task (in 2024) was relaxed-to-month.

Page updated

Report abuse