Chemotherapy Treatment Timelines Extraction
from the Clinical Narrative:
Evaluation
Evaluation Process and Details
Participants may participate in either or both subtasks. Evaluation will be done against a held-out test set. The test set will be made available only for a short period of time. Participants will be instructed to submit the output of their systems in a format specified by the organizers (see Submission of Test Output). The organizers will run the evaluation script distributed with the train and development data to produce the final results.
System-extracted timelines will be evaluated against the gold patient-level timelines by comparing the <'chemo', 'temporal_relation', 'TIMEX3'> tuples to the gold timelines. F1 score is computed at the patient level as the macro F1 score, i.e. the average F1 score across all patients.
There are two evaluation settings:
Strict evaluation, in which case the chemotherapy EVENT, TIMEX3, and the temporal relation between them must match exactly the gold standard to be counted as a match. For example, if the system predicted ['taxol', 'contains-1', '2013-06-17'], but the gold tuple is ['taxol', 'begins-on', '2013-06-17'], the system prediction would not be a match because the values of the temporal relations are different.
Relaxed evaluation, in which case we consider 'contains-1' and 'begins-on', 'contains-1' and 'ends-on' interchangeable. Therefore, ['taxol', 'contains-1', '2013-06-17'] and ['taxol', 'begins-on', '2013-06-17'] is counted as a match. In this setting, if system predictions fall into the correct timeframe of the gold timeline annotations, they are are also counted as a match. For example, if the system predicts ['taxol', 'contains-1', '2013-06-17'], and there are ['taxol', 'begins-on', '2013-03'] and ['taxol', 'ends-on', '2013-09'] in the gold timeline annotations, the system prediction will be counted as a match because '2013-06-17' falls in the gold timeframe, i.e. from '2013-03' to '2013-09'.
In the relaxed evaluation, additional evaluation settings are “relaxed to month” and “relaxed to year”, where only the month and the year, or only the year need to match the gold annotation respectively.
Each system will be evaluated using a held-out test set comprised of manually labeled data points. To ensure fair evaluation, the test set includes a significant portion of data points labeled by machines (silver labels) mixed with those labeled manually. However, systems will be evaluated solely based on the results of the manually-labeled data points. This approach is adopted because the tasks involve two sub-tasks, and the size of the datasets is not large enough to prevent manual annotation efforts by the participants. Participants are prohibited from attempting to differentiate between these two types of data points. Noise was intentionally added to the entire test dataset for obscuration purposes, and it is expected to have little or no effect on the performance of the systems.
Evaluation Code
The evaluation code is available here: https://github.com/HealthNLPorg/chemoTimelinesEval.
The expected input to the evaluation code is the json format described in Submission of Test Output.