Sensitivity-Aware Neural Networks for Robust Bayesian Inference (Lasse Elsemüller, Hans Olischläger, Marvin Schmitt, Paul-Christian Bürkner, Ullrich Koethe, Stefan T. Radev)
Sensitivity analysis is an essential method for understanding the robustness and reliability of statistical inference in applied settings. While theoretically appealing, it is overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions.
Computer vision models normally witness degraded performance when deployed in real-world scenarios, due to unexpected changes in inputs that were not accounted for during training. Data augmentation is commonly used to address this issue, as it aims to increase data variety and reduce the distribution gap between training and test data. However, common visual augmentations might not guarantee extensive robustness of computer vision models. In this paper, we propose Auxiliary Fourier-basis Augmentation (AFA), a complementary technique targeting augmentation in the frequency domain and filling the augmentation gap left by visual augmentations. We demonstrate the utility of augmentation via Fourier-basis additive noise in a straightforward and efficient adversarial setting. Our results show that AFA benefits the robustness of models against common corruptions, OOD generalization, and consistency of performance of models against increasing perturbations, with negligible deficit to the standard performance of models. It can be seamlessly integrated with other augmentation techniques to further boost performance. Code and models can be found at: https://github.com/nis-research/afa-augment.
Determining the sensitivity of the posterior to perturbations of the prior and likelihood is an important part of the Bayesian workflow. We introduce a practical and computationally efficient sensitivity analysis approach using importance sampling to estimate properties of posteriors resulting from power-scaling the prior or likelihood. On this basis, we suggest a diagnostic that can indicate the presence of prior-data conflict or likelihood noninformativity and discuss limitations to this power-scaling approach. The approach can be easily included in Bayesian workflows with minimal effort by the model builder and we present an implementation in our new R package priorsense. We further demonstrate the workflow on case studies of real data using models varying in complexity from simple linear models to Gaussian process models.
We propose a method to improve the efficiency and accuracy of amortized Bayesian inference by leveraging universal symmetries in the joint probabilistic model of parameters and data. In a nutshell, we invert Bayes' theorem and estimate the marginal likelihood based on approximate representations of the joint model. Upon perfect approximation, the marginal likelihood is constant across all parameter values by definition. However, errors in approximate inference lead to undesirable variance in the marginal likelihood estimates across different parameter values. We penalize violations of this symmetry with a self-consistency loss which significantly improves the quality of approximate inference in low data regimes and can be used to augment the training of popular neural density estimators. We apply our method to a number of synthetic problems and realistic scientific models, discovering notable advantages in the context of both neural posterior and likelihood approximation.
A key step in the Bayesian workflow for model building is the graphical assessment of predictions generated by a given model, whether from the prior or posterior. The goal of these assessments is to identify whether the model is a reasonable (and ideally accurate) representation of the observed data. Despite the key role of these visual predictive checks in a Bayesian workflow, there is a lack of clear, evidence-based, guidance for selecting, interpreting, and diagnosing appropriate visualisations. To reduce errors and ad-hoc decision-making during these steps, we present recommendations for visual predictive checks for observations that are: continuous, discrete, or a mixture of the two. To support the application of these recommendations, we also discuss additional diagnostics to aid in the selection of visual methods. Specifically, in the detection of an incorrect assumption of continuously-distributed data: identifying when data is likely to be discrete or contain discrete components, detecting and estimating and possible bounds in data, and a diagnostic of the goodness-of-fit to data for density plots made through kernel density estimates (KDE). As a visual predictive check itself can be viewed as a model fit to data, assessing when this model fails to represent the data is important for drawing well-informed conclusions.
Detecting out-of-distribution (OOD) instances is crucial for the reliable deployment of machine learning models in real-world scenarios. OOD inputs are commonly expected to cause a more uncertain prediction in the primary task; however, there are OOD cases for which the model returns a highly confident prediction. This phenomenon, denoted as "overconfidence", presents a challenge to OOD detection. Specifically, theoretical evidence indicates that overconfidence is an intrinsic property of certain neural network architectures, leading to poor OOD detection. In this work, we address this issue by measuring extreme activation values in the penultimate layer of neural networks and then leverage this proxy of overconfidence to improve on several OOD detection baselines. We test our method on a wide array of experiments spanning synthetic data and real-world data, tabular and image datasets, multiple architectures such as ResNet and Transformer, different training loss functions, and include the scenarios examined in previous theoretical work. Compared to the baselines, our method often grants substantial improvements, with double-digit increases in OOD detection AUC, and it does not damage performance in any scenario.
Model selection is a commonly suggested approach for mitigating the risk of poor generalisation or overfitting in a range of modelling scenarios, especially when models of increasing complexity are considered. Bayesian modelling workflows often require the consideration of different candidate models, and approaches for model selection in the Bayesian framework aim to support the modeller in navigating potential trade-offs between model complexity and generalisability of the results to yet unobserved data. In this work, we propose a change of perspective towards choosing generative priors, instead of relying on model selection after the fact. We revisit the issue of overfitting, and clarify why model selection is not necessarily needed and can even be harmful in some modelling scenarios with finite data. When integrating over the posterior and using generatively consistent priors, even if those priors can be considered weakly informative, we can safely use flexible models with a large number of parameters. We illustrate the relevance of appropriate prior choices, as well as the limitations and alternatives for model selection in different modelling tasks in simulated and real-data examples.
The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes.
Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ''neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ''neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection.
Fairness via Transparency: We use conformal prediction to provide individualised uncertainty for personalised medicine. Using a publicly available TCGA breast cancer dataset, we compute conformal prediction sets for the Oncotype DX recurrence score and compare both performance and uncertainty across ancestral subpopulations.
In this paper we construct and theoretically analyse group equivariant convolutional kernel networks (CKNs) which are useful in understanding the geometry of (equivariant) CNNs through the lens of reproducing kernel Hilbert spaces (RKHSs). We then proceed to study the stability analysis of such equiv-CKNs under the action of diffeomorphism and draw a connection with equiv-CNNs, where the goal is to analyse the geometry of inductive biases of equiv-CNNs through the lens of reproducing kernel Hilbert spaces (RKHSs). Traditional deep learning architectures, including CNNs, trained with sophisticated optimization algorithms is vulnerable to perturbations, including ‘adversarial examples’. Understanding the RKHS norm of such models through CKNs is useful in designing the appropriate architecture and can be useful in designing robust equivariant representation learning models.
Machine learning (ML) has been increasingly applied across various fields, with the prediction rapidly impacting individuals. Consequently, ML fairness has emerged as a critical issue. Alongside this, research in ML fairness has proliferated and evolved. This study addresses three key challenges in this domain. Group fairness domination and sensitive attribute categorisation, such as skin colour to skin types, cause neglect of latent biases within groups. Next, previous fairness research has predominantly been outcome-oriented, ignoring biases present in the model's prediction processes. Last, the robustness is toward fairness. To tackle these challenges, we defined a multi-perspective, fairness-centred evaluation framework. Our framework applies statistical distance measures to assess model fairness and robustness. Furthermore, we employ explainability techniques to validate the fairness justification. In addition, we introduce a bias mitigation mechanism based on skin colour nuance using our performance evaluation algorithm. Our approach is empirically validated using five datasets in image classification tasks, focusing on skin colour as a case study. Our novel fairness evaluation algorithm and framework enhance the measurement and assurance of fairness in ML.
We propose a novel generalized Bayesian inference (GBI) framework called H"{o}lder-Bayes, based on the H"{o}lder score. Our framework can be used both in GBI with intractable likelihood and in the context of likelihood-free inference. We theoretically show that H"{o}lder-Bayes has an ideal robustness property that asymptotically ignores the effects of outliers without assuming a bounded generic loss function. Furthermore, we provide some concentration bounds. Since this is an ongoing study, certain theoretical guarantees and numerical experiments will be carried out as future work up to publication.
Raising likelihoods to a power has become a popular approach for calibrating parameter uncertainty in Bayesian analysis. Here, we instead focus on the induced posterior predictive. This leads to a surprising discovery: predictively, in regular parametric models, Bayes posteriors perform as well as power posteriors. Further, even for moderate sample sizes, the choice of temperature is inconsequential to predictive performance if it scales at least at rate n^-1/2.
Brain tumor segmentation (BTS) is conventionally formulated as a voxel-level classification that relies on binary segmentation masks. However, such an approach treats foreground and background voxels equally, neglecting the intricacies and uncertainties of annotated voxels. In this paper, we approach BTS as a voxel-level regression. We propose a novel image transformation, termed Signed Normalized Geodesic Transform that takes into account voxels' difficulty when converting binary masks to soft labels. Furthermore, to address the imbalance of foreground and background voxels, we introduce a Focal-like regression loss. Empirically, our method shows a superior performance across various network architectures when compared to voxel-wise classification.
Accurate retinal vessel segmentation is a crucial step in the quantitative assessment of retinal vasculature, which is needed for the early detection of retinal diseases and other conditions. Numerous studies have been conducted to tackle the problem of segmenting vessels automatically using a pixel-wise classification approach. The common practice of creating ground truth labels is to categorize pixels as foreground and background. This approach is, however, biased, and it ignores the uncertainty of a human annotator when it comes to annotating e.g. thin vessels. In this work, we propose a simple and effective method that casts the retinal image segmentation task as an image-level regression. For this purpose, we first introduce a novel Segmentation Annotation Uncertainty-Aware (SAUNA) transform, which adds pixel uncertainty to the ground truth using the pixel's closeness to the annotation boundary and vessel thickness. To train our model with soft labels, we generalize the earlier proposed Jaccard metric loss to arbitrary hypercubes, which is a second contribution of this work. The proposed SAUNA transform and the new theoretical results allow us to directly train a standard U-Net-like architecture at the image level, outperforming all recently published methods. We conduct thorough experiments and compare our method to a diverse set of baselines across 5 retinal image datasets.