February 1st - 10h - 17h
10h - 1h15: Opening (Willlem Waegeman, Universiteit Gent)
10h15-11h: General overview of uncertainty in machine learning (Eyke Huellermeier, LMU München)
11h - 12h30: Calibration I
Fit-on-Test View on Evaluating Calibration of Classifiers (Markus Kängsepp, University of Tartu) - presentation
Generality-training of a Classifier for Improved Calibration in Unseen Contexts (Bhawani Shankar Leelar, University of Tartu) - presentation
12h30 - 13h30: Lunch break
13h30 - 15h: Imprecise probabilities
Structural Causal Models Are (Solvable by) Credal Networks (Alessandro Antonucci, IDSIA) - presentation
Prime implicants as explanation of robust classification (Sébastien Destercke & colleagues, CNRS-Heudiasyc) - presentation
15h - 15h30: Break
15h30 - 17h: Cross-validation
Towards practical applications of cross-validation error combinatorics, new statistical significance tests and beyond (Tapio Pahikkala, University of Turku)
The out-of-sample R2: estimation and inference (Stijn Hawinkel, Universiteit Gent) - presentation
19h - 22h: Workshop dinner in restaurant Pakhuis (City Center)
February 2nd - 9h - 17h
9h - 10h30: Uncertainty in deep learning
Autoinverse: Uncertainty Aware Inversion of Neural Networks (Navid Ansari, Max Planck Institute for Informatics) - presentation
Rethinking the choice of loss functions for classification with deep learning (Viacheslav Komisarenko, University of Tartu) - presentation
10h30 - 11h: Break
11h - 12h30: Epistemic uncertainty quantification using risk minimization
On the Fundamental Flaw of Quantifying Epistemic Uncertainty through Loss Minimisation (Viktor Bengs, LMU München) - presentation
The Unreasonable Effectiveness of Deep Evidential Regression (Nis Meinert, Pasteur Labs) - presentation
12h30 - 13h30: Lunch break
13h30 - 15h45: Conformal prediction
Calibrated multi-probabilistic prediction as a defense against adversarial attacks (Jonathan Peck, Universiteit Gent) - presentation
Conditional conformal prediction: An overview (Nicolas Dewolf, Universiteit Gent) - presentation
Conformal prediction intervals for remaining useful lifetime estimation (Alireza Javanmardi, LMU München) - presentation
15h45 - 16h15: Break
16h15 - 17h45: Calibration II
Guarantees for High Confidence Predictions in Binary Scoring Classifiers (Mari-Liis Allikivi, University of Tartu) - presentation
On the calibration of probabilistic classifier sets (Thomas Mortier and Mira Kristin Juergens, Universiteit Gent)
February 3rd - 9h - 14h
9h - 10h30: Human annotations
Learning of individual preferences under uncertainty (Loic Adam, CNRS-Heudiasyc) - presentation
Modelling Human Biases and Uncertainties in Temporal Annotations (Taku Yamagata, University of Bristol) - presentation
10h30 - 11h: Break
11h - 12h30: Downstream usage of uncertainty measures
Aligning evaluation of uncertainty-aware forecasts to their downstream usage (Novin Shahroudi, University of Tartu) - presentation
Robust Bayesian optimization under input uncertainty (Jixiang Qing, Universiteit Gent)
12h30 - 13h30: Lunch
13h30- ? Informal discussion and project meetings (The official program ends here, but we can still use the room for pop-up talks and meetings in small groups)
List of participants:
Abstracts:
Fit-on-Test View on Evaluating Calibration of Classifiers (Markus Kängsepp)
Calibrated uncertainty estimates are essential for classifiers used in safety-critical applications. Every uncalibrated classifier has a corresponding idealistic true calibration map that calibrates its uncertainty. Deviations of this idealistic map from the identity map reveal miscalibration. Such calibration errors can be reduced with many post-hoc calibration methods which fit some family of functions on a validation dataset. In contrast, evaluation of calibration with the expected calibration error (ECE) on the test set does not explicitly involve fitting. With our proposed fit-on-test view on evaluating calibration, we demonstrate the benefits of fitting on the test set for evaluation purposes. This enables the usage of any post-hoc calibration method as an evaluation measure, unlocking missed opportunities in development of evaluation methods. We prove that ECE is actually also a fit-on-test measure, and thus its number of bins can be tuned with cross-validation.
Generality-training of a Classifier for Improved Calibration in Unseen Contexts (Bhawani Shankar Leelar)
Artificial neural networks tend to output class probabilities that are miscalibrated, \ie their reported uncertainty is not a very good indicator of how much we should trust the model. Consequently, methods have been developed to improve the model's predictive uncertainty, both during training and post-hoc. Even if the model is then calibrated on the domain used in training, it typically becomes over-confident when applied on slightly different target domains, e.g. due to perturbations or shifts in the data. The model can be recalibrated for a fixed list of target domains, but its performance can still be poor on any unseen target domains. To address this issue, we propose a generality-training procedure that learns a modified head for the neural network, to achieve better calibration generalization to new domains while retaining calibration performance on the given domains. This model is trained on multiple domains using a new objective function with increased emphasis on the calibration loss compared to cross-entropy. This results in a more general model, in the sense of not only better calibration but also better accuracy on unseen domains, as we demonstrate experimentally on multiple datasets.
Structural Causal Models Are (Solvable by) Credal Networks (Alessandro Antonucci)
A structural causal model is made of endogenous (manifest) and exogenous (latent) variables. We show that endogenous observations induce linear constraints on the probabilities of the exogenous variables. This allows to exactly map a causal model into a credal network. Causal inferences, such as interventions and counterfactuals, can consequently be obtained by standard algorithms for the updating of credal nets. These natively return sharp values in the identifiable case, while intervals corresponding to the exact bounds are produced for unidentifiable queries. A characterization of the causal models that allow the map above to be compactly derived is given, along with a discussion about the scalability for general models. This contribution should be regarded as a systematic approach to represent structural causal models by credal networks and hence to systematically compute causal inferences. A number of demonstrative examples is presented to clarify our methodology. Extensive experiments show that approximate algorithms for credal networks can immediately be used to do causal inference in real-size problems.
Prime implicants as explanation of robust classification (Sébastien Destercke & colleagues)
Providing elements of explanations is becoming an essential part of modern AI. In contrast with explanations based on LIME or SHAP, explanations based on logical notions provides elements of explanation with stronger guarantees. This talk will briefly review some of these logical explanations, before focusing on the case of prime implicants for imprecise probabilistic classifiers, extending standard approaches and providing preliminary ideas about how to explain abstention.
Towards practical applications of cross-validation error combinatorics, new statistical significance tests and beyond (Tapio Pahikkala)
Cross-validation (CV) is a popularly used method for estimating supervised binary classifiers' prediction performance with limited amounts of data. Given the estimate, the next emerging natural question is how probable it is to obtain equally good or even better result if the labels of the data were randomly assigned. Building on the no-free-lunch theorems for machine learning on one hand and the theory of error detecting codes on the other hand, we seek a general answer to this question such that would hold over all possible learning algorithms. As an immediate analogy, we see that the maximal number of binary classification problems for which a machine learning method can have no CV errors is equal to the maximum number of code words in an error detecting code of the same word length as the number of the available data and the same error detecting capability as the size of the hold-out set in CV. As a case study, we consider area under ROC curve (AUC) estimation via leave-pair-out cross-validation (LPOCV). Within this case study, we present an extended result by bounding the probability of obtaining as good or better LPOCV based AUC estimates by introducing what we call light error detecting codes. If this probability is smaller than some pre-defined significance threshold, one can reject the null hypothesis that the class labels are randomly assigned. Accordingly, our results enable the design of new kinds of tests of statistical significance for CV based classification performance estimates. More generally, our results provide novel ways to analyze learning algorithms in terms of their CV error distributions.
The out-of-sample R2: estimation and inference (Stijn Hawinkel)
Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained by the model to total variance can be summarized by the coefficient of determination or in-sample R2 which is easy to interpret and to compare across different outcome variables. As opposed to the in-sample R2, the out-of-sample R2 has not been well defined and the variability on the out-of-sample R2 estimator has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define the out-of-sample R2 as a comparison of two predictive models, provide an unbiased estimator for it and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for the R2 estimate. The bias and variance of the estimators for the R2 and its standard error are investigated in a simulation study. Next, we demonstrate our new method by constructing confidence intervals and comparing models for a real data example involving prediction of quantitative Brassica napus and Zea mays phenotypes based on gene expression data.
Rethinking the choice of loss functions for classification with deep learning (Viacheslav Komisarenko)
The loss function has an important role in the training process, heavily influencing backward gradients, network parameter updates, and, therefore, the performance of the final model. Model performance could be evaluated from different perspectives. Firstly, how well the predicted logits are ranked among each other, relating to the metrics based on the AUC score. Secondly, how well the predicted logits are located with respect to a specific point (e.g. zero), yielding metrics such as the error rate. Thirdly, how well the predicted probabilities align with one-hot encoded ground truth labels, relating to cross-entropy and mean squared error, for example. Finally, how well the predicted probabilities align with the ground truth frequencies of each (or some) class. Each evaluation metric group could require a different approach for choosing a loss function. However, typically the same loss function is used for training, regardless of which evaluation measure is ultimately of interest. The training loss is usually either cross-entropy or losses based on cross-entropy (focal loss, label smoothing and their modifications). We studied the choice of the loss function under class cost uncertainty and derived a new loss function family that outperforms the commonly used cross-entropy and focal loss on both standard and cost-sensitive evaluation metrics on image binary classification datasets. Also, we suggest a unified view that combines probability-level losses and link functions during training and evaluation. This theory could explain recent successes of focal loss and its latest modifications and suggest further directions for improvements.
On the Fundamental Flaw of Quantifying Epistemic Uncertainty through Loss Minimisation (Viktor Bengs)
The increasing use of machine learning methods in safety-critical applications has led to an increase in research interest in uncertainty quantification. In this context, a distinction between aleatoric and epistemic uncertainty has proven useful, the latter referring to the learner's (lack of) knowledge and being particularly difficult to measure and quantify. A steadily growing branch of the literature proposes the use of a second-order learner that provides predictions in terms of distributions over probability distributions. While standard (first-order) learners can be trained to predict accurate probabilities, namely by minimising suitable loss functions on sample data, recent work has shown serious shortcomings for second-order predictors based on loss minimisation. In this talk, we highlight a fundamental problem with these approaches: All of the proposed loss functions provide no incentive for second-order learners to represent their epistemic uncertainty in the same faithful way as first-order learners.
The Unreasonable Effectiveness of Deep Evidential Regression (Nis Meinert)
There is a significant need for principled uncertainty reasoning in machine learning systems as they are increasingly deployed in safety-critical domains. A new approach with uncertainty-aware regression-based neural networks (NNs), based on learning evidential distributions for aleatoric and epistemic uncertainties, shows promise over traditional deterministic methods and typical Bayesian NNs, notably with the capabilities to disentangle aleatoric and epistemic uncertainties. Despite some empirical success of Deep Evidential Regression (DER), there are important gaps in the mathematical foundation that raise the question of why the proposed technique seemingly works. We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a heuristic rather than an exact uncertainty quantification. We go on to discuss corrections and redefinitions of how aleatoric and epistemic uncertainties should be extracted from NNs.
Quantifying Prediction Uncertainty in Regression using Random Fuzzy Sets: the ENNreg model (Thierry Denoeux)
I introduce a neural network model for regression in which prediction uncertainty is quantified by Gaussian random fuzzy numbers (GRFNs), a newly introduced family of random fuzzy subsets of the real line that generalizes both Gaussian random variables and Gaussian possibility distributions. The output GRFN is constructed by combining GRFNs induced by prototypes using a combination operator that generalizes Dempster's rule of Evidence Theory. The three output units indicate the most plausible value of the response variable, variability around this value, and epistemic uncertainty. The network is trained by minimizing a loss function that generalizes the negative log-likelihood. Comparative experiments show that this method is competitive, both in terms of prediction accuracy and calibration error, with state-of-the-art techniques such as random forests or deep learning with Monte Carlo dropout. In addition, the model outputs a predictive belief function that can be shown to be calibrated, in the sense that it allows us to compute conservative prediction intervals with specified belief degree.
Calibrated multi-probabilistic prediction as a defense against adversarial attacks (Jonathan Peck)
Machine learning (ML) classifiers - in particular deep neural networks - are surprisingly vulnerable to so-called adversarial examples. These are small modifications of natural inputs which drastically alter the output of the model even though no relevant features appear to have been modified. One explanation that has been offered for this phenomenon is the calibration hypothesis, which states that the probabilistic predictions of typical ML models are miscalibrated. As a result, classifiers can often be very confident in completely erroneous predictions. Based on this idea, we propose the MultIVAP algorithm for defending arbitrary ML models against adversarial examples. Our method is inspired by the inductive Venn-ABERS predictor (IVAP) technique from the field of conformal prediction. The IVAP enjoys the theoretical guarantee that its predictions will be perfectly calibrated, thus addressing the problem of miscalibration. Experimental results on five image classification tasks demonstrate empirically that the MultIVAP has a reasonably small computational overhead and provides significantly higher adversarial robustness without sacrificing accuracy on clean data. This increase in robustness is observed both against defense-oblivious attacks as well as a defense-aware white-box attack specifically designed for the MultIVAP.
Conditional conformal prediction: An overview (Nicolas Dewolf)
We compare and analyze different approaches to conditional uncertainty estimation based on conformal prediction. The main competitors in this category are normalized and Mondrian conformal prediction. Whereas the latter has theoretical conditionality guarantees, the former has the benefit of avoiding further data splits. In particular, we consider the situation where the conditioning happens with respect to the uncertainty. We consider both theoretical relations and empirical trade-offs.
Guarantees for High Confidence Predictions in Binary Scoring Classifiers (Mari-Liis Allikivi)
High confidence predictions differ from lower ones since trusting them can lead to bigger bets or riskier actions. Overconfidence in those cases could constitute big losses. Confidences can be improved via calibration, but there is no certainty to be fully calibrated instead of just more calibrated then before. We propose an idea for a method to obtain guarantees for high confidence predictions. Whenever the output of a model can be turned into a binary sequence ordered by the predicted confidences, the method gives out guarantees for each point in the higher end of this ordered sequence. Guarantee calculation goes beyond binary classification as the method can be applied for any problem that is representable as binary evaluation (e.g. right/wrong) with a score attached to the decision. Guarantee at each point can be represented with a value q, indicating that the true probability at this point is larger than q with probability q. Intuitively, high values of q indicate that we can be very certain that the true probability is very high. The method is not yet finalized as we still need to figure out some pieces of the puzzle.
On the calibration of probabilistic classifier sets (Thomas Mortier and Mira Kristin Juergens)
Multi-class classification methods that produce sets of probabilistic classifiers, such as ensemble learning methods, are able to model aleatoric and epistemic uncertainty. Aleatoric uncertainty is then typically quantified via the Bayes error, and epistemic uncertainty via the size of the set. In this paper, we extend the notion of calibration, which is commonly used to evaluate the validity of the aleatoric uncertainty representation of a single probabilistic classifier, to assess the validity of an epistemic uncertainty representation obtained by sets of probabilistic classifiers. Broadly speaking, we call a set of probabilistic classifiers calibrated if one can find a calibrated convex combination of these classifiers. To evaluate this notion of calibration, we propose a novel nonparametric calibration test that generalizes an existing test for single probabilistic classifiers to the case of sets of probabilistic classifiers. Making use of this test, we empirically show that ensembles of deep neural networks are often not well calibrated.
Learning of individual preferences under uncertainty (Loic Adam)
The aim of this short presentation is to first introduce the listener to preference elicitation, where we want to determine the preferences of a single user, rather of a population. We also want to show why uncertainty is a problem in preference elicitation, and why we want to manage/repair it. We focus on incremental elicitation and its robust counterpart. The first point of the talk is to show why it's a nice method in general, but does not work well with uncertainty. A second point is on possibility theory, and how it can manage uncertainty rather effectively. A last point is on how to repair inconsistency, and focus on two proposed solutions: changing the fusion rule, or using Maximal Coherent Subsets.
Modelling Human Biases and Uncertainties in Temporal Annotations (Taku Yamagata)
In supervised learning, low quality annotations lead to poorly performing classification and detection models while also rendering evaluation unreliable. Annotation quality is affected by multiple factors. For example, in the post-hoc self-reporting of daily activities, cognitive biases are one of the most common ingredients. In particular, reporting the start and duration of an activity after its finalisation may incorporate biases introduced by personal time perceptions, as well as the imprecision and lack of granularity affected by time rounding. When dealing with time-bounded data, the annotations' consistency over the event is particularly important for both event detection and classification. Here we propose a method to model human biases and the uncertainties on temporal annotations as well as the use of soft labels. Experimental results in synthetic data show that soft labels are a better approximation of the ground truth for several metrics. We showcase the method on a real dataset of daily activities.
Autoinverse: Uncertainty Aware Inversion of Neural Networks (Navid Ansari)
Neural networks are powerful surrogates for numerous forward processes. The inversion of such surrogates is extremely valuable in science and engineering. The most important property of a successful neural inverse method is the performance of its solutions when deployed in the real world, i.e., on the native forward process (and not only the learned surrogate). We propose Autoinverse, a highly automated approach for inverting neural network surrogates. Our main insight is to seek inverse solutions in the vicinity of reliable data which have been sampled form the forward process and used for training the surrogate model. Autoinverse finds such solutions by taking into account the predictive uncertainty of the surrogate and minimizing it during the inversion. Apart from high accuracy, Autoinverse enforces the feasibility of solutions, comes with embedded regularization, and is initialization free. We verify our proposed method through addressing a set of real-world problems in control, fabrication, and design.
Aligning evaluation of uncertainty-aware forecasts to their downstream usage (Novin Shahroudi)
Every forecast is ultimately used downstream; however, a forecast evaluation process typically does not consider implications on the downstream, e.g., how much value/cost the downstream task would benefit/bear using that forecast. A more general phenomenon is that a better forecast according to a specific evaluation measure may not necessarily lead to a higher value (or lower cost) for the downstream task. This issue could also be rephrased as a misalignment of a forecast evaluation process with the downstream objective. One can consider forecast evaluation directly on the downstream objective. Still, usually, it is costly, but also, a forecast might be of use to multiple downstream tasks with different objectives, so it is necessary to have an assessment of some sort from the view of both. Moreover, in many cases, at the time of issuing a forecast or building a forecasting model, its potential downstream applications may not be accessible or even known. Besides, a downstream task may not have access to the forecasting model but only to the already issued forecasts. In this talk, we are aspired to find an answer to the question: Is it possible to align the evaluation to the downstream task’s objective? And would it be possible to use existing evaluation frameworks, such as scoring rules to achieve such a goal? To answer these questions, we go through a few of our empirical results and an overview of the directions we took to understand and address the problem of aligning forecast evaluation with a downstream task's objective.