SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

                         

Codalab competition

Please visit the Codalab to participate in the task, and join the Slack channel for more information and help!

Motivation

Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks (Brown et al., 2020; Chowdhery et al., 2022). However, they are heavily susceptible to shortcut learning (Geirhos et al., 2020; Poliak et al., 2018; Tsuchiya, 2018), factual inconsistency (Elazar et al., 2021), and performance degradation when exposed to word distribution shifts (Miller et al., 2020; Lee et al., 2020), data transformations (Xing et al., 2020; Stolfo et al., 2022), and adversarial examples (Li et al., 2020). Crucially, these limitations can lead to an overestimation of the real-world performance (Patel et al., 2008; Recht et al., 2019), and are therefore of particular concern in the context of medical applications. 

Given the increasing deployment of LLMs in real-world scenarios, we propose a textual entailment task to advance our understanding of models’ behaviour and improve existing evaluation methodologies for clinical Natural Language Inference NLI). Through the systematic application of controlled interventions, each engineered to investigate a specific semantic phenomenon involved in natural language and numerical inference, we investigate the robustness, consistency, and faithfulness of the reasoning performed by LLMs in the clinical setting. 

Clinical trials are conducted to assess the effectiveness and safety of new treatments and are crucial for the advancement of experimental medicine (Avis et al., 2006). Clinical Trial Reports (CTR), outline the methodology and findings of a clinical trial. Healthcare professionals use CTRs to design and prescribe experimental treatments. However, with over 400,000 CTRs available, and more being published at an increasing pace (Bastian et al., 2010), it is impractical to conduct a thorough assessment of all relevant literature when designing treatments (DeYoung et al., 2020). NLI (Bowman et al., 2015) presents a possible solution, enabling the large-scale interpretation and retrieval of medical evidence, connecting the most recent findings to facilitate personalized care (Sutton et al., 2020). To advance research at the intersection of NLI and Clinical Trials (NLI4CT), we organised "SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data" (Jullien et al., 2023), see SemEval 2023. Task 7 had an entailment and evidence selection subtask, with 643 submissions from 40 participants, and 364 submissions from 23 participants respectively. While the previous iteration of NLI4CT led to the development of models based on LLMs (Zhou et al., 2023; Kanakarajan and Sankarasubbu, 2023; Vladika and Matthes, 2023) achieving high performance (i.e., f1 score ≈ 85%), the application of LLMs in critical domains, such as real-world clinical trials, requires further investigations accompanied by the development of novel evaluation methodologies grounded in a more systematic behavioural and causal analyses (Wang et al.,2021).

This second iteration is intended to ground NLI4CT in interventional and causal analyses of NLI models (YU et al., 2022), enriching the original NLI4CT dataset with a novel contrast set, developed through the application of a set of interventions on the statements in the NLI4CT test set.

Thanks to the explicit causal relation between the designed interventions and expected labels, The proposed methodology will allow us to explore the following research aims through a causal lens:

RA1: "To investigate the consistency of NLI models in their representation of semantic phenomena necessary for complex inference in clinical NLI settings"

RA2: " To investigate the ability of clinical NLI models to perform faithful reasoning, i.e., make correct predictions for the correct reasons."

Task Overview

Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT)

This task is based on a collection of breast cancer CTRs (extracted from https://clinicaltrials.gov/ct2/home), statements, explanations, and labels annotated by domain expert annotators. 

Task: Textual Entailment

For the purpose of the task, we have summarised the collected CTRs into 4 sections:

The annotated statements are sentences with an average length of 19.5 tokens, that make some type of claim about the information contained in one of the sections in the CTR premise. The statements may make claims about a single CTR or compare 2 CTRs.  The task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. The training set we provide is identical to the training set used in our previous task, however, we have performed a variety of interventions on the test set and development set statements, either preserving or inversing the entailment relations. We will not disclose the technical details adopted to perform the interventions to guarantee fair competition and in the interest of encouraging approaches that are robust and not simply designed to tackle these interventions. The technical details will be made publicly available after the evaluation phase, and in our task description paper.

Intervention targets

Organisers

References