Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. In this study, we introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs) with a focus on robustness and explainability. Specifically, we leverage a large language model (LLM) with a public ECG-report dataset for pretraining on an ECG-report alignment task. Our LLM-informed pretraining can handle labeling uncertainty and mixed languages in the dataset while ensuring pathologically informed representation learning. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study: patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI). To enhance explainability, we present a novel, lightweight dual-attention ECG network featuring a cross-lead attention module and twelve lead-specific temporal attention modules, which can visualize cross-lead interactions and each lead's local dynamics. The results reveal that LLM-informed pretraining substantially enhances HF risk prediction in these cohorts. The dual-attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method's potential in advancing HF risk assessment using clinically complex ECG data.
Domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. However, the traditional adaptation by further training on in-domain data rapidly weakens the model’s ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. This work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. Our results show that (1) avoiding the single-truth assumption can largely mitigate catastrophic forgetting of adaptation while (2) preserving the adaptation in-domain improvements (3) with negligible additions to compute costs. In the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but naive exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives.
The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes.
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on four models from different families (Gemma, Phi3, Mistral, Zephyr) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
Current research in adversarial robustness of LLMs focuses on \textit{discrete} input manipulations in the natural language space, which can be directly transferred to \textit{closed-source} models. However, this approach neglects the steady progression of \textit{open-source} models. As open-source models advance in capability, ensuring their safety becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the \textit{embedding space attack}, which directly attacks the \textit{continuous} embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Additionally, we demonstrate that models compromised by embedding attacks can be used to create discrete jailbreaks in natural language. Lastly, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs.
This work investigates Retrieval-Augmented Generation (RAG) aimed to reduce hallucinations in table question answering. RAG provides the Large Language Model with the required context to answer the query. Table question answering using Retrieval-Augmented Generation involves two crucial tasks: table retrieval and answer generation. Important questions in this research field remain unsolved. First, there is a lack of rigorous evaluation on datasets with diverse properties. We perform experiments on two datasets with different characteristics: open-domain NQ-Tables based on Wikipedia and closed-domain AIT-QA based on corporate data from the airline industry. While open-domain datasets cover many different topics, closed-domain datasets operate only in one topical area. Second, we study two approaches of table retrieval to find the method that performs best in identifying the ground truth table. We compare a sparse bag-of-words information retrieval method to a dense information retrieval method. Additionally, we introduce a novel re-ranking approach that re-orders the top k tables with the goal of enhancing an existing ranking. Third, the optimal representation of tables and the optimal prompt template remain unclear. To address this, we experiment with various table serialization schemes, prompt templates and an approach to compress long contexts. Our evaluation reveals substantial differences in the performance on the two datasets. We find the dense retrieval method SBERT using CSV table representation to perform better on NQ-Tables than sparse retrieval approaches such as BM25. However, AIT-QA benefits more from sparse retrieval methods with CSV table representation achieving the best results in combination with our novel re-ranking approach. To conclude, our contributions identify important future research directions to boost Retrieval-Augmented Generation for table question answering tasks.
Applications of large language models (LLMs) often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears generally difficult to define. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed measures can be applied to black-box LLMs. We present preliminary experiments in question answering and machine translation, where we extracted broadly meaningful uncertainty estimates from GPT and Gemini models and quantified their calibration.
Recent works show the impressive effectiveness of an agent framework in solving problems with language models. In this work, we apply two key features from the framework, interaction with tools and goal-oriented training, to improve models' arithmetical reasoning.
First, we curate and transform existing datasets to create Calc-X, a standardized collection with over 300,000 problems with step-by-step solutions. We use Calc-X to train models we call Calcformers that interact with a calculator during inference. Calcformers achieve twice the accuracy of standard baselines.
Finally, we optimize Calcformers via self-training using preference optimization and supervised loss by checking the model's predicted results. We find that self-training can achieve substantial improvements on out-of-domain problems and that traditional supervised loss is a strong baseline for preference optimization. Our results show that preference optimization converges faster and isn't prone to forgetting pre-trained abilities.
Uncertainty quantification in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. In particular, uncertainty can be used to improve the trustworthiness of LLMs by detecting factually incorrect model responses, commonly called hallucinations. Critically, one should seek to capture the model's semantic uncertainty, i.e., the uncertainty over the meanings of LLM outputs, rather than uncertainty over lexical or syntactic variations that do not affect answer correctness. To address this problem, we propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs. KLE defines positive semidefinite unit trace kernels to encode the semantic similarities of LLM outputs and quantifies uncertainty using the von Neumann entropy. It considers pairwise semantic dependencies between answers (or semantic clusters), providing more fine-grained uncertainty estimates than previous methods based on hard clustering of answers. We theoretically prove that KLE generalizes the previous state-of-the-art method called semantic entropy and empirically demonstrate that it improves uncertainty quantification performance across multiple natural language generation datasets and LLM architectures.
Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
In-context learning (ICL) can solve new tasks on pre-trained Large Language Models (LLMs) given a few demonstrations as input. However, so far there is little understanding of how many demonstrations are required for the real-world scenario, e.g., large-label-space classification. In this work, we conduct a meticulous study under various settings with different LLMs among datasets. Our insights suggest that no demonstrations might be required, especially when the class names are descriptive and the model is strong-performing (e.g., GPT-4). Nevertheless, datasets with extremely large label space can benefit with additional human-created demonstrations, while automatically generated ones might not yield additional benefits.
Early detection of wildfires is essential to prevent large-scale fires resulting in extensive environmental and structural damage. Modern deep learning-based computer vision methods enable high-resolution detection of smoke which can be used for early wildfire detection and localisation. Specialised models can even be employed on lightweight drone-carried computers to enable detection in remote areas with low infrastructure. However, such methods suffer from sufficient training data distribution and are thus unable to perform in unseen conditions. This is particularly dangerous for methods intended to recognise emergencies such as wildfires. Multimodal large language models (LLMs) can identify various phenomena in a zero-shot manner offering more robust scaling of the detection domain but they suffer from computationally heavy inference meaning that specialised models are still required to enable real-time inference with limited computational resources. These models can, however, be trained with LLM pseudo-labelling requiring only unlabelled images and language queries. This builds on previous knowledge of weakly supervised learning for computer vision models through combined vision-language understanding, which has been studied in various contexts. The proposed method could improve the domain adaptability of wildfire detection over previous deep learning-based wildfire smoke segmentation methods. A simple way to formulate the language queries for the pseudo-labelling is through visual question answering (VQA). With feature maps, the language query results can also be transformed into segmentation masks for training models capable of pixel-level detection. LLMs also offer the possibility to evaluate additional features not captured by typical detection or segmentation labels such as rough distance estimates. Using LLMs can thus improve the reliability of smaller detection models in various conditions for which training data collection would otherwise be extremely laborious or even practically impossible.
Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of Large Language Models (LLMs) like GPT-4 introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset of 166 papers from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date, encompassing 3,066 references in total. In our experiment, GPT-4 was tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias in GPT-4, which persists even after controlling for publication year, title length, number of authors, and venue. Additionally, we observe a large consistency between the characteristics of GPT-4's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended by GPT-4 are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases and introduce new ones, potentially skewing scientific knowledge dissemination. Our results underscore the need for identifying the model's biases and for developing balanced methods to interact with LLMs in general.
Despite its success, the robustness aspect of reinforcement learning from human feedback (RLHF) remains unclear when applied to fine-tune unsupervised language models. In this paper, we theoretically study the robustness of the optimally aligned model against contamination in the fine-tuning dataset. We first model the generating process of the contaminated dataset in the RLHF context by extending the popular contamination model in robust and Bayesian statistics. Our analysis based on this contamination model reveals the key role of \beta---a hyperparameter that controls the level of alignment---in profoundly influencing contamination robustness. In particular, it is suggested that \beta values reported to empirically achieve good performance are highly vulnerable to contamination, while values that do not achieve good performance may conversely acquire robustness. This implies a trade-off between robustness and alignment performance through \beta.
Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.
This research assesses the effectiveness of state-of-the-art large language models (LLMs)—ChatGPT, Llama, Aya, and ACEGPT—in handling Arabic automated essay scoring (AES) using the AR-AES dataset. It explores various training methodologies, including zero-shot, few-shot, and fine-tuning. Additionally, it investigates whether LLMs' enhanced reasoning abilities and instruction-following capabilities improve AES performance. The study addresses significant tokenization challenges inherent in processing Arabic, a language with unique morphological and syntactic structures. These challenges include character-level processing, which increases sequence length, and decoder-level representation, which can obscure semantic clarity at the word level. Among the models tested, ACEGPT demonstrated the strongest performance across the dataset, achieving an F1 score of 0.74 and a Quadratic Weighted Kappa (QWK) of 0.67. Performance variation across different courses underscores the need for adaptive models capable of handling diverse assessment formats and highlights the positive impact of effective prompt engineering on enhancing LLM outputs. Furthermore, this research advances the understanding of LLMs in Arabic AES by not relying solely on traditional metrics such as QWK. It introduces innovative metrics to assess the accuracy of models in predicting outcomes within a single mark range of actual scores, providing a nuanced perspective on their practical utility in educational settings.
Cutting-edge abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed. Many techniques for detecting factual inconsistencies between summaries and sources build pipelines around natural language inference (NLI) or question-answering (QA) models with additional supervised learning steps. In contrast, similarity-based metrics, which simply compare a summary's embeddings to those of a reference or source text, have been reported to fail in reflecting summaries' factuality. In this paper, we revisit similarity-based metrics, showing that this failure stems from the granularity of embeddings used and the use of reference summaries for comparison. We propose a new factuality evaluation metric, Sentence-BERT Score (SBERTScore), which compares sentences in the summary to sentences in the source document. We evaluate SBERTScore against human factuality labels using a benchmark consisting of nine datasets. SBERTScore outperforms previous word-word metrics including BERTScore and can compete with existing NLI and QA-based factuality metrics without needing any fine-tuning. We find low agreement between factuality metrics, and determine that a combination of complementary metrics can better capture factual inconsistency than any individual technique.
Conversational large language models are fine-tuned for safety, resulting in models that refuse harmful requests. In this work, we show that this refusal behavior is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal, with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how to benchmark and evaluate those instantaneously? Following the consistency check framework, we measure forecasting performance on certain topics according to how consistent the predictions on different logically related questions are. The main consistency metric we use is one of arbitrage: for example, if a forecasting AI predicts 60% probability for both the Democratic and Republican parties to win the 2024 US presidential election, an arbitrageur could trade against the forecaster's predictions and make a profit. We build an automated evaluation system: starting from the instruction "query the forecaster's predictions on the topic of X," our evaluation system generates a set of base questions, instantiates the consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We conclude with the possible applications of our work in steering and evaluating superhuman AI oracle systems.
Safety fine-tuning is widely used to align Large Language Models (LLMs) with human preferences for their safe deployment. In this work, we design a synthetic data generation framework to carefully investigate and understand the underlying factors that makes LLMs safe via safety fine-tuning. Our framework allows controlled generation of samples by capturing key aspects of a real-world instruction corresponding to multiple types of safe and unsafe samples. Using this, we investigate three well-known safety fine-tuning methods: (1) Supervised safety fine-tuning; (2) Direct preference optimization; and (3) Unlearning, and provide insights on what makes corresponding models safe and why their safety is compromised via jailbreaking and adversarial attacks. We also validate our findings, wherever possible, on real-world models - Llama-2 chat 7B and Llama-3 chat 8B.
Seamless, multi-modal analysis of large biological data sets is essential for increasing our understanding of human disease. Techniques like genomics, transcriptomics, proteomics, epigenetics and digital pathology are creating vast amounts of data from human and animal studies. These data need to be carefully integrated and analysed to understand disease mechanisms, develop new biomarkers and novel drug targets. Multi-Dimensional Viewer (MDV) is a tool for analysing, annotating and sharing such multi-dimensional biological and omics data sets. It has a user-friendly, powerful web-based interface that allows clinicians and scientists to gain insight and present and share their analysis and interactive data visualisation. Our aim is to firstly enable users with little or no programming experience to analyse and visualise their data in MDV through natural language querying. Secondly, we wish to reduce the overhead required generating the views programmatically or through the web-based interface by automating the generation of visualisations using Large Language Models (LLM) using natural language. We present ChatMDV, a novel pipeline that enables MDV to become an easy to use text-to-visualisation tool combining the power of a graphical “point and click” interface with natural language enhancement.
Citation practices influence the architecture of scientific knowledge and are frequently shaped by prevailing norms and biases. The potential use of Large Language Models (LLMs) to conduct research introduces novel complexities into these practices. In our experiment, we use a dataset from major conferences like AAAI, NeurIPS, ICML, and ICLR, and analyze how LLMs process and recommend scholarly citations (from memory). Our analysis evaluates LLM’s effectiveness to do so by comparing LLM-generated citations against ground-truth references from the dataset. In addition, we examine citation graph properties including connectivity, clustering coefficients, centrality measures, and path lengths of both sets. Finally, we evaluate whether GNNs can significantly improve the prediction which graph was LLM-generated and which one corresponds to the dataset ground truth.
Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine.