SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Codalab competition

Please visit the Codalab to participate in the task, and join the Slack channel for more information and help!

See the SemEval 2023 Task 7 overview paper
See the Dataset paper

Motivation

Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks (Brown et al., 2020; Chowdhery et al., 2022). However, they are heavily susceptible to shortcut learning (Geirhos et al., 2020; Poliak et al., 2018; Tsuchiya, 2018), factual inconsistency (Elazar et al., 2021), and performance degradation when exposed to word distribution shifts (Miller et al., 2020; Lee et al., 2020), data transformations (Xing et al., 2020; Stolfo et al., 2022), and adversarial examples (Li et al., 2020). Crucially, these limitations can lead to an overestimation of the real-world performance (Patel et al., 2008; Recht et al., 2019), and are therefore of particular concern in the context of medical applications.

Given the increasing deployment of LLMs in real-world scenarios, we propose a textual entailment task to advance our understanding of models’ behaviour and improve existing evaluation methodologies for clinical Natural Language Inference NLI). Through the systematic application of controlled interventions, each engineered to investigate a specific semantic phenomenon involved in natural language and numerical inference, we investigate the robustness, consistency, and faithfulness of the reasoning performed by LLMs in the clinical setting.

Clinical trials are conducted to assess the effectiveness and safety of new treatments and are crucial for the advancement of experimental medicine (Avis et al., 2006). Clinical Trial Reports (CTR), outline the methodology and findings of a clinical trial. Healthcare professionals use CTRs to design and prescribe experimental treatments. However, with over 400,000 CTRs available, and more being published at an increasing pace (Bastian et al., 2010), it is impractical to conduct a thorough assessment of all relevant literature when designing treatments (DeYoung et al., 2020). NLI (Bowman et al., 2015) presents a possible solution, enabling the large-scale interpretation and retrieval of medical evidence, connecting the most recent findings to facilitate personalized care (Sutton et al., 2020). To advance research at the intersection of NLI and Clinical Trials (NLI4CT), we organised "SemEval-2023 Task 7: Multi-Evidence Natural Language Inference for Clinical Trial Data" (Jullien et al., 2023), see SemEval 2023. Task 7 had an entailment and evidence selection subtask, with 643 submissions from 40 participants, and 364 submissions from 23 participants respectively. While the previous iteration of NLI4CT led to the development of models based on LLMs (Zhou et al., 2023; Kanakarajan and Sankarasubbu, 2023; Vladika and Matthes, 2023) achieving high performance (i.e., f1 score ≈ 85%), the application of LLMs in critical domains, such as real-world clinical trials, requires further investigations accompanied by the development of novel evaluation methodologies grounded in a more systematic behavioural and causal analyses (Wang et al.,2021).

This second iteration is intended to ground NLI4CT in interventional and causal analyses of NLI models (YU et al., 2022), enriching the original NLI4CT dataset with a novel contrast set, developed through the application of a set of interventions on the statements in the NLI4CT test set.

Thanks to the explicit causal relation between the designed interventions and expected labels, The proposed methodology will allow us to explore the following research aims through a causal lens:

• RA1: "To investigate the consistency of NLI models in their representation of semantic phenomena necessary for complex inference in clinical NLI settings"

• RA2: " To investigate the ability of clinical NLI models to perform faithful reasoning, i.e., make correct predictions for the correct reasons."

Task Overview

Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT)

This task is based on a collection of breast cancer CTRs (extracted from https://clinicaltrials.gov/ct2/home), statements, explanations, and labels annotated by domain expert annotators.

Task: Textual Entailment

For the purpose of the task, we have summarised the collected CTRs into 4 sections:

Eligibility criteria - A set of conditions for patients to be allowed to take part in the clinical trial
Intervention - Information concerning the type, dosage, frequency, and duration of treatments being studied.
Results - Number of participants in the trial, outcome measures, units, and the results.
Adverse events - These are signs and symptoms observed in patients during the clinical trial.

The annotated statements are sentences with an average length of 19.5 tokens, that make some type of claim about the information contained in one of the sections in the CTR premise. The statements may make claims about a single CTR or compare 2 CTRs. The task is to determine the inference relation (entailment vs contradiction) between CTR - statement pairs. The training set we provide is identical to the training set used in our previous task, however, we have performed a variety of interventions on the test set and development set statements, either preserving or inversing the entailment relations. We will not disclose the technical details adopted to perform the interventions to guarantee fair competition and in the interest of encouraging approaches that are robust and not simply designed to tackle these interventions. The technical details will be made publicly available after the evaluation phase, and in our task description paper.

Intervention targets

Numerical - LLMs still struggle to consistently apply numerical and quantitative reasoning. As NLI4CT requires this type of inference, we will specifically target the models' numerical and quantitative reasoning abilities.
Vocabulary and syntax - Acronyms and aliases are significantly more prevalent in clinical texts than general domain texts, and disrupt the performance of clinical NLI models. Additionally, models may experience shortcut learning, relying on syntactic patterns for inference. We target these concepts and patterns with an intervention.
Semantics - LLMs struggle with complex reasoning tasks when applied to longer premise-hypothesis pairs. We also intervene on the statements to exploit this.
Notes - The specific type of intervention performed on a statement will not be available at test or training time.

Organisers

Mael Jullien - University of Manchester
Marco Valentino - University of Manchester, Idiap Research Institute
Andre Freitas - University of Manchester, Idiap Research Institute

References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Sutton, C. (2022). Palm: Scaling language modeling with pathways.
Geirhos, R., Jacobsen, J. H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks.
Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis only baselines in natural language inference.
Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment.
Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Schütze, H., & Goldberg, Y. (2021). Measuring and improving consistency in pretrained language models.
Miller, J., Krauth, K., Recht, B., & Schmidt, L. (2020). The effect of natural distribution shift on question answering models.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). Biobert: a pre-trained biomedical language representation model for biomedical text mining.
Xing, X., Jin, Z., Jin, D., Wang, B., Zhang, Q., & Huang, X. (2020). Tasty burgers, soggy fries: Probing aspect robustness in aspect-based sentiment analysis.
Stolfo, A., Jin, Z., Shridhar, K., Schölkopf, B., & Sachan, M. (2022). A causal framework to quantify the robustness of mathematical reasoning with language models.
Patel, K., Fogarty, J., Landay, J. A., & Harrison, B. (2008). Investigating statistical machine learning as a tool for software development.
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet?