# Statistics in Data-Centric Engineering (S-DCE)

# Seminar Series

### The Alan Turing Institute, London

## Overview

The S-DCE seminar series is a weekly online seminar of The Data-Centric Engineering Programme at The Alan Turing Institute. Talks are usually at 11am on Wednesdays though when we have speakers from abroad we occasionally meet earlier or later in order to accommodate time differences. We combine internal speakers from the group at the ATI with invited speakers from all over the world, most commonly presenting their recent research but occasionally a broader survey of a topic. (We used to call ourselves a reading group, but seminar series is more what we've become.) Talks cover a variety of subjects ranging from theoretical statistics to methodological developments, to the engineering applications of machine learning. The group is open to everyone. Please contact the organisers if you would like to join our mailing list, to which we send the link for each online talk.

Past talks are archived below and, further back, at the group's old site https://dce-rg.github.io/

## Upcoming Talks

**Here is the link for the talk:**

Join Zoom Meeting: https://turing-uk.zoom.us/j/2523406049?pwd=TFpGUURncVl2WE9ZbDZSTmd3M3Fydz09

Meeting ID: 252 340 6049

Passcode: 428671

### 8 June 11:00

Alexander Terenin (University of Cambridge)

**Non-Euclidean Matérn Gaussian Processes**

In recent years, the machine learning community has become increasingly interested in learning in settings where data lives in non-Euclidean spaces, for instance in applications to physics and engineering, or other settings where it is important that symmetries are enforced. In this talk, we will develop a class of Gaussian process models defined on Riemannian manifolds and graphs, and show how to effectively perform all computations needed to train these models using standard automatic-differentiation-based methods. This gives an effective framework to deploy data-efficient interactive decision-making systems such as Bayesian optimization to settings with symmetries and invariances.

### 1 June 11:00

Siu Lun Chau (University of Oxford)

**Deconditional Downscaling with Gaussian Processes**

Refining low-resolution (LR) spatial fields with high-resolution (HR) information, often known as statistical downscaling, is challenging as the diversity of spatial datasets often prevents direct matching of observations. Yet, when LR samples are modeled as aggregate conditional means of HR samples with respect to a mediating variable that is globally observed, the recovery of the underlying fine-grained field can be framed as taking an "inverse" of the conditional expectation, namely a deconditioning problem. In this work, we propose a Bayesian formulation of deconditioning which naturally recovers the initial reproducing kernel Hilbert space formulation from Hsu and Ramos (2019). We extend deconditioning to a downscaling setup and devise efficient conditional mean embedding estimator for multiresolution data. By treating conditional expectations as inter-domain features of the underlying field, a posterior for the latent field can be established as a solution to the deconditioning problem. Furthermore, we show that this solution can be viewed as a two-staged vector-valued kernel ridge regressor and show that it has a minimax optimal convergence rate under mild assumptions. Lastly, we demonstrate its proficiency in a synthetic and a real-world atmospheric field downscaling problem, showing substantial improvements over existing methods.

### 25 May 11:00

Veit D. Wild (University of Oxford)

**Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning**

We develop a framework for generalized variational inference in infinite- dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimization criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.

### 18 May 11:00

Guanyang Wang (Rugters University, New Brunswick)

**Unbiased Multilevel Monte Carlo methods for intractable distributions: MLMC meets MCMC**

Constructing unbiased estimators from MCMC outputs has recently increased much attention in statistics and machine learning communities. However, the existing unbiased MCMC framework only works when the quantity of interest is an expectation. In this work, we propose unbiased estimators for functions of expectations. Our idea is based on the combination of the unbiased MCMC and MLMC methods. We prove the theoretical properties of our estimator. We also illustrate our estimator on several examples, including estimating the ratio of normalizing constants and the nested expectation. This is a joint work with Tianze Wang.

### 11 May 11:00

Yuchen Zhu (University College London)

**Relaxing Observability Conditions in Causal Inference **

Causal Inference is necessary in many social science domains for understanding the effects of interventions such as that of a new drug, or that of educational policy changes. A fundamental obstacle in achieving the consistent estimation of such effects is the existence of latent variables. Often practitioners have to deal with such latency with observed covariates which can be seen as unclean records of the latent variable. Moreover, departing from traditional statistical methods, which often exhibits consistency guarantees at the cost of restrictive modelling assumptions, kernel methods are a flexible approach for nonparametric estimation but where guarantees can still be achieved. With these goals in mind, in this talk I will describe ways to formalise the problem and outline kernel-based methods to solve them.

### 4 May 11:00

Binxin Ru (University of Oxford)

**Bayesian Optimisation for Neural Architecture Search**

Bayesian optimisation (BO) has been widely used for hyperparameter optimisation but its application in neural architecture search (NAS) is limited due to the non-continuous, high-dimensional and graph-like search spaces. This talk will cover two novel methods to enable effective application of BO on NAS: 1) integrating the Weisfeiler-Lehman graph kernel into a Gaussian process surrogate to naturally handle the graph nature of architectures in a highly data-efficient manner and also afford interpretability by discovering useful network features and their corresponding impact on the network performance and 2) recasting NAS as a problem of finding the optimal network generator instead of a single optimal architecture so as to significantly reduce the search dimension, making NAS amenable to BO.

## Past Talks

### 16 Mar 11:00

Alessandro Rudi (INRIA & ENS)

**Representing non-negative functions, with applications in non-convex optimization, probability representation and beyond**

Many problems in applied mathematics are expressed naturally in terms of non-negative functions. While linear models are well suited to represent functions with output in R, being at the same time very expressive and flexible, the situation is different for the case of non-negative functions where the existing models lack one of good properties. In this talk we present a rather flexible and expressive model for non-negative functions. We will show direct applications in probability representation and non-convex optimization. In particular, the model allows to derive an algorithm for non-convex optimization that is adaptive to the degree of differentiability of the objective function and achieves optimal rates of convergence. Finally, we show how to apply the same technique to other interesting problems in applied mathematics that can be easily expressed in terms of inequalities.

### 9 Mar 11:00

Kamyar Azizzadenesheli (Purdue)

**Neural Operators: Learn to Solve Partial Differential Equations**

Traditional deep neural networks are maps between finite dimension spaces, and hence, are not suitable for modeling phenomena such as those arising from the solution of partial differential equations (PDE). We introduce neural operators that can learn operators, which are maps between infinite dimension spaces. By framing neural operators as non-linear compositions of kernel integrations, we establish that they are universal approximators of operators. They are independents of the resolution or grid of training data and allow for zero-shot generalization to higher resolution evaluations. We find that neural operators can solve turbulent fluid flow, seismic wave equation, co2 storage, and many more hard problems with 100000x speedup compared to numerical solvers. I will outline several applications where neural operators have shown order of magnitude speedup.

### 2 Mar 11:00

Tim Wolock (Imperial)

**Evaluating distributional regression strategies for modelling self-reported sexual age-mixing**

Predicting complex data with parsimonious and interpretable models is a persistent challenge in applied statistics. By combining distributional regression with flexible probability distributions, we can use simple linear models to fit to datasets that conventional regression models would predict poorly. In this work, we built the four-parameter sinh-arcsinh distribution into a distributional regression framework to predict self-reported sexual partner age distribution data. These data measure the rate of sexual partnership formation across ages and are an important input to epidemiological models of HIV. To validate our approach, we conducted two model comparison studies on three geographically diverse datasets. In this talk, I will introduce the sinh-arcsinh distribution and provide an overview of the fundamentals of distributional regression, including a brief demonstration of how we have implemented our model in BRMS. I will then describe the design and results of our two model comparison studies. Finally, I will discuss how the framework we have proposed could be extended with well-known hierarchical modelling tools and how distributional regression methods could be applied more broadly.

Paper link: https://elifesciences.org/articles/68318

### 23 Feb 11:00

Samuel Livingstone (UCL)

**The Barker proposal and other locally-balanced Markov chain Monte Carlo algorithms**

I will introduce a class of \pi-reversible Markov processes termed ‘locally-balanced’. Any member of the class can be used to design Metropolis—Hastings algorithms. I will discuss a couple of prominent members of the class, one of which is in fact the well-known Metropolis-adjusted Langevin algorithm, and another is an approach that we call the ‘Barker proposal’, which is inspired by Barker’s alternative acceptance rate within the Metropolis—Hastings algorithm. I will explore the pros and cons of each algorithm through some theory and examples, before then discussing how to choose an optimal algorithm within the locally-balanced class. This is based on joint work with Giacomo Zanella, Jure Vogrinc and Max Hird.

### 16 Feb 14:30

**NOTE DIFFERENT TIME**

Matthew Reimherr (Penn State)

**Pure Differential Privacy in Functional Data Analysis**

We consider the problem of achieving pure differential privacy in the context of functional data analysis, or more general nonparametric statistics, where the summary of interest can naturally be viewed as an element of a function space. In this talk I will give a brief overview and motivation for differential privacy before delving into the challenges that arise in the sanitization of an infinite dimensional summary. I will present a new mechanism, called the Independent Component Laplace Process, for achieving privacy followed by examples to mean function estimation and nonparametric density estimation.

### 9 Feb 11:00

Athénaïs Gautier (Bern)

**The Spatial Logistic Gaussian Process, and how estimating spatially dependent distributions can accelerate Bayesian inference**

When studying natural or artificial systems, it is common for the response of interest to not be fully determined by the system parameters x, but rather to be random and to follow a probability distribution that depends on x. In this talk we want to show that it is possible to estimate the underlying field based only on a finite number of observations, and that the associated uncertainty quantification can be highly instrumental for Bayesian inversion. The approach that we investigate here generalizes to spatial contexts a class of non-parametric Bayesian density models based on logistic Gaussian processes, and allows modelling (probability) density-valued fields with complex dependences on x while accommodating heterogeneous sample sizes. The Spatial Logistic Gaussian Process (SLGP) main strength is that it draws its flexibility from an underlying Gaussian Process, allowing to incorporate knowledge and structural information within the model, while conserving the non-parametric nature of the later. The considered models allow for instance performing (approximate) posterior simulations of probability density functions as well as jointly predicting multiple moments or other functionals of target distributions. We propose an implementation of the SLGP and investigate ways of using the proposed class of model to speed up Approximate Bayesian Computing (ABC) methods.

### 2 Feb 11:00

Valentin De Bortoli (Oxford)

** Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling**

Progressively applying Gaussian noise transforms complex data distributions to approximately Gaussian. Reversing this dynamic defines a generative model. When the forward noising process is given by a Stochastic Differential Equation (SDE), Song et al. (2021) demonstrate how the time inhomogeneous drift of the associated reverse-time SDE may be estimated using score-matching. A limitation of this approach is that the forward-time SDE must be run for a sufficiently long time for the final distribution to be approximately Gaussian. In contrast, solving the Schrödinger Bridge problem (SB), i.e. an entropy-regularized optimal transport problem on path spaces, yields diffusions which generate samples from the data distribution in finite time. We present Diffusion SB (DSB), an original approximation of the Iterative Proportional Fitting (IPF) procedure to solve the SB problem, and provide theoretical analysis along with generative modeling experiments. The first DSB iteration recovers the methodology proposed by Song et al. (2021), with the flexibility of using shorter time intervals, as subsequent DSB iterations reduce the discrepancy between the final-time marginal of the forward (resp. backward) SDE with respect to the prior (resp. data) distribution. Beyond generative modeling, DSB offers a widely applicable computational optimal transport tool as the continuous state-space analogue of the popular Sinkhorn algorithm (Cuturi, 2013).

### 26 Jan 11:00

Harita Dellaporta (Warwick)

**Robust Bayesian Inference for Simulator-based Models via the MMD Posterior Bootstrap**

Simulator-based models are models for which the likelihood is intractable but simulation of synthetic data is possible. They are often used to describe complex real-world phenomena, and as such can often be misspecified in practice. Unfortunately, existing Bayesian approaches for simulators are known to perform poorly in those cases. In this paper, we propose a novel algorithm based on the posterior bootstrap and maximum mean discrepancy estimators. This leads to a highly-parallelisable Bayesian inference algorithm with strong robustness properties. This is demonstrated through an in-depth theoretical study which includes generalisation bounds and proofs of frequentist consistency and robustness of our posterior. The approach is then assessed on a range of examples including a g-and-k distribution and a toggle-switch model.

### 15 Dec 11:00

Amy Parkes (Southampton)

**Error Measures for Machine Learning Regression to Approximate the Ground Truth**

As machine learning technology improves, it is increasingly relied upon when making significant decisions which require a high level of trust. Accuracy and interpretability are paramount for trust in regression methods, which comprise a large portion of the field. To apply these methods with confidence there needs to be a certainty that they have modelled the ground truth of a dataset— the correct input-output relationships. Conventional regression error measures, however, do not ensure that the correct relationships are modelled, as they only require accurate point predictions to assign low error to a method. A case study of power prediction for merchant vessels is used to illustrate the problem, where accurate prediction and correct input-output relationship modelling is required, although there is limited understanding of these input-output relationships. A new error measure, the Mean Fit to Median Error, is presented which ensures networks approximate the conditional averages and is applicable to any dataset. Networks reporting low Mean Fit to Median errors model more consistent and correct input-output relationships and are robust to areas of sparse data.

### 17 Nov 11:00

David Bossens (Southampton)

**Beyond MDPs: reinforcement learning in unknown long-term environments**

Traditionally, reinforcement learning is considered within the Markov Decision Process (MDP) framework. This presentation discusses challenges that come up with applying reinforcement learning within unknown long-term environments, including exploration, long-term dependencies, task sequences, and long-term safety constraints. The presentation then proposes solutions that go significantly beyond the traditional MDP framework, including self-improvement, lifelong reinforcement learning, and constrained MDPs.

### 10 Nov 11:00

Rachel Prudden (Met Office & Exeter)

**Random fields for generative multi-scale modelling**

Gaussian random fields are a commonly used method in spatial statistics. I will give an overview of how they can be applied to problems involving multiple spatial scales, such as super-resolution, and discuss extensions to non-Gaussian data.

### 03 Nov 11:00

Jonathan Schmidt (Tübingen)

**A Probabilistic State Space Model for Joint Inference from Differential Equations and Data**

Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.

### 20 Oct 11:00

George Wynne (Imperial)

**MMD and KSD, two sides of the same coin?**

Maximum Mean Discrepancy (MMD) and Kernel Stein Discrepancy (KSD) are two kernel-based non-parametric methods for forming a discrepancy between probability measures. Their study has been a very active area in statistical machine learning and increasingly so in computational statistics. The idea for both these methodologies revolves around using kernels to facilitate easy to estimate discrepancies which can then be used as estimators in a wide range of tasks, such as two-sample testing, goodness-of-fit testing, parameter inference, measure transport and MCMC output quality assessment to name but a few. So far though MMD has enjoyed much wider theoretical investigation than KSD, mostly due to the KSD formulation being somewhat more complicated. The aim of this talk is to outline how MMD and KSD are actually more related than one might think. This relationship can then be leveraged to provide conditions for when KSD can separate measures in generality of the base space being a separable Hilbert space. This generality encompasses distributions over function spaces which will be used in numerical examples.

### 13 Oct 11:00

Simulation models of scientific interest often lack a tractable likelihood function, precluding standard likelihood-based statistical inference. As a result, likelihood-free approaches have emerged in recent decades as a means to performing statistical inference for such models, which typically involve comparing simulated and observed data in some fashion. An example is approximate Bayesian computation, in which the pertinence of parameter settings is assessed by some meaningful notion of distance between the simulated and observed data. Time-series data is a particular challenge in this respect, often being high-dimensional and complex in structure. In this talk, we will discuss the use of path signatures as a means to performing likelihood-free inference with time-series simulators. We will first discuss the problem of likelihood-free inference for simulation models and the properties of the path signature. We will then discuss their use in traditional approaches to likelihood-free inference, such as approximate Bayesian computation, and in more recently developed approaches based on the likelihood-ratio trick. In each case, we will present experimental results and discuss some of the properties of path signatures which make them a desirable tool for learning with time-series data.

### 29 Sept 14:30

### 28 July 11:00

Johanna Meier (Hannover)

**Discrepancy-based inference for intractable generative models using quasi-Monte Carlo** [**URL**]

Intractable generative models are models for which the likelihood is unavailable but sampling is possible. Most approaches to parameter inference in this setting require the computation of some discrepancy between the data and the generative model. This is for example the case for minimum distance estimation and approximate Bayesian computation. These approaches require sampling a high number of realisations from the model for different parameter values, which can be a significant challenge when simulating is an expensive operation. In this paper, we propose to enhance this approach by enforcing "sample diversity" in simulations of our models. This will be implemented through the use of quasi-Monte Carlo (QMC) point sets. Our key results are sample complexity bounds which demonstrate that, under smoothness conditions on the generator, QMC can significantly reduce the number of samples required to obtain a given level of accuracy when using three of the most common discrepancies: the maximum mean discrepancy, the Wasserstein distance, and the Sinkhorn divergence. This is complemented by a simulation study which highlights that an improved accuracy is sometimes also possible in some settings which are not covered by the theory.

### 21 July 11:00

Takuo Matsubara (Newcastle & ATI)

**Robust Generalised Bayesian Inference for Intractable Likelihoods** [**URL**]

Generalised Bayesian inference updates prior beliefs using a loss function, rather than a likelihood, and can therefore be used to confer robustness against possible misspecification of the likelihood. Here we consider generalised Bayesian inference with a Stein discrepancy as a loss function, motivated by applications in which the likelihood contains an intractable normalisation constant. In this context, the Stein discrepancy circumvents evaluation of the normalisation constant and produces generalised posteriors that are either closed form or accessible using standard Markov chain Monte Carlo. On a theoretical level, we show consistency, asymptotic normality, and bias-robustness of the generalised posterior, highlighting how these properties are impacted by the choice of Stein discrepancy. Then, we provide numerical experiments on a range of intractable distributions, including applications to kernel-based exponential family models and non-Gaussian graphical models.

### 7 July 11:00

Toni Karvonen (ATI)

**Maximum likelihood estimation of the length-scale parameter in Gaussian process regression** [**URL**]

Maximum likelihood estimation is often used to select hyperparameters of the covariance kernel in Gaussian process regression. Not much is known about the behaviour of these estimates behave in the deterministic interpolation regime where the data are assumed to have been generated without noise. We consider the length-scale parameter of a stationary kernel which has a substantial effect on the predictions of the Gaussian process model and show that its maximum likelihood estimate is very sensitive to small perturbations in the data. Specifically, under the assumption that the stationary kernel induces a Sobolev space (e.g., a Matérn kernel), the maximum likelihood estimate is infinite if and only if the data could have been generated by a constant function. We also discuss several common additional modelling choices which do not alleviate this problem.

### 30 June 11:00

Juan Kuntz Nussio (Warwick)

**Product-form estimators: exploiting independence to scale up Monte Carlo** [**URL**]

We introduce a class of Monte Carlo estimators for product-form target distributions that aim to overcome the rapid growth of variance with dimension often observed for standard estimators. We identify them with a class of generalized U-Statistics, and thus establish their unbiasedness, consistency, and asymptotic normality. Moreover, we show that they achieve lower variances than their conventional counterparts given the same number of samples drawn from the target, investigate the gap in variance via several examples, and identify the situations in which the difference is most, and least, pronounced. We further study the estimators' computational cost and delineate the settings in which they are most efficient. We illustrate their utility beyond the setting of product-form distributions by detailing two simple extensions (one to targets that are mixtures of product-form distributions and another to targets that are absolutely continuous with respect to product-form distributions) and conclude by discussing further possible uses.

### 23 June 11:00

We propose a new method for Bayesian prediction that caters for models with a large number of parameters and is robust to model misspecification. Given a class of high-dimensional (but parametric) predictive models, this new approach constructs a posterior predictive using a variational approximation to a loss-based, or Gibbs, posterior that is directly focused on predictive accuracy. The theoretical behavior of the new prediction approach is analyzed and a form of optimality demonstrated. Applications to both simulated and empirical data using high-dimensional Bayesian neural network and autoregressive mixture models demonstrate that the approach provides more accurate results than various alternatives, including misspecified likelihood-based predictions.

### 16 June 11:00

This talk is about Stein optimal transport (Stein-OT), a novel methodology for Bayesian inference that pushes an ensemble of particles along a predefined curve of tempered probability distributions. The driving vector field is chosen from a reproducing kernel Hilbert space and can equivalently be obtained from either a suitable kernel ridge regression formulation or as an infinitesimal optimal transport map. The update equations of Stein-OT resemble those of Stein variational gradient descent (SVGD), but introduce a time-varying score function as well as specific weights attached to the particles. I will discuss the geometric underpinnings of Stein-OT and SVGD, and -- time permitting -- connections to MCMC and the theory of large deviations.

### 09 June 11:00

Maud Lemercier (Warwick)

**Higher Order Mean Embeddings for Stochastic Processes**

Stochastic processes are random variables with values in some space of paths. However, reducing a stochastic process to a path-valued random variable ignores its filtration, i.e. the flow of information carried by the process through time. By conditioning the process on its filtration, we introduce a family of higher order kernel mean embeddings (KMEs) that generalizes the notion of KME and captures additional information related to the filtration. We derive empirical estimators for the associated higher order maximum mean discrepancies (MMDs) and construct a filtration-sensitive kernel two-sample test able to pick up information that gets missed by the standard MMD test. In addition, leveraging our higher order MMDs we construct a family of universal kernels on stochastic processes that allows to solve real-world optimal stopping problems in quantitative finance (such as the pricing of American options) via classical kernel-based regression methods.

### 02 June 11:00

Lorenzo Pacchiardi (Oxford)

**Generalized Bayesian Likelihood-Free Inference Using Scoring Rules Estimator**

We propose a framework for Bayesian Likelihood-Free Inference (LFI) based on Generalized Bayesian Inference using scoring rules (SRs). SRs are used to evaluate probabilistic models given an observation; a proper SR is minimised in expectation when the model corresponds to the data generating process for the observations. Using a strictly proper SR, for which the above minimum is unique, ensures posterior consistency of our method. Further, we prove finite sample posterior consistency and outlier robustness of our posterior for the Kernel and Energy Scores. As the likelihood function is intractable for LFI, we employ consistent estimators of SRs using model simulations in a pseudo-marginal MCMC; we show the target of such chain converges to the exact SR posterior by increasing the number of simulations. Furthermore, we note popular LFI techniques such as Bayesian Synthetic Likelihood (BSL) can be seen as special cases of our framework using only proper (but not strictly so) SR. We empirically validate our consistency and outlier robustness results and show how related approaches do not enjoy these properties. Practically, we use the Energy and Kernel Scores, but our general framework sets the stage for extensions with other scoring rules.

### 26 May 11:00

Hans Kersting (INRIA Paris)

**Uncertainty-Aware Numerical Solutions of ODEs by Bayesian Filtering**

Numerical approximations can be regarded as statistical inference, if one interprets the solution of the numerical problem as a parameter in a statistical model whose likelihood links it to the information (`data') available from evaluating functions. This view is advocated by the field of Probabilistic Numerics and has already yielded two successes: Bayesian Optimization and Bayesian Quadrature. In an analogous manner, we construct a Bayesian probabilistic-numerical method for ODEs. To this end, we construct a probabilistic state space model for ODEs which enables us to borrow the machinery of Bayesian filtering. This unlocks the application of all Bayesian filters from signal processing to ODEs, which we name ODE filters. We theoretically analyse the convergence rates of the most elementary one, the Kalman ODE filter and discuss its uncertainty quantification. Lastly, we demonstrate how employing these ODE filters as forward simulators engenders new ODE inverse problem solvers that outperform its classical 'likelihood-free' counterparts.

### 19 May 11:00

Solving decision making problems in a variety of domains such as healthcare or operations research requires experimentation. By performing interventions one can understand how a system behaves when an action is taken and thus infer the cause-effect relationships of a phenomenon. Experiments are usually expensive, time-consuming, and may present ethical issues. Therefore, researchers generally have to trade-off cost, time, and other practical considerations to decide which experiments to conduct in order to learn about a system. In this talk I will present two methodologies that, by linking causal inference, experimental design and Gaussian process (GP) modelling, allow to efficiently learn the causal effects in a graph and identify the optimal intervention to perform. Firstly, I will show how to construct a multi-task causal GP model, the DAG-GP model, which captures the non-trivial correlation structure across different experimental outputs. By sharing experimental information, the DAG-GP model accurately estimates the causal effects in a variety of experimental settings while enabling proper uncertainty quantification. I will then demonstrate how this model, and more generally GP models, can be used within decision-making algorithm to choose experiments to perform. Particularly, I will introduce the Causal Bayesian Optimization algorithm and I will show how incorporating the knowledge of the causal graph in Bayesian Optimization improves the ability to reason about optimal decision making while decreasing the optimization cost and avoiding suboptimal solutions.

### 12 May 11:00

Christian Fröhlich (University of Tübingen, Germany)

**Bayesian Quadrature on Riemannian Data Manifolds **[**URL**]

Riemannian manifolds provide a principled way to model nonlinear geometric structure inherent in data. A Riemannian metric on said manifolds determines geometry-aware shortest paths and provides the means to define statistical models accordingly. However, these operations are typically computationally demanding. To ease this computational burden, we advocate probabilistic numerical methods for Riemannian statistics. In particular, we focus on Bayesian quadrature (BQ) to numerically compute integrals over normal laws on Riemannian manifolds learned from data. In this task, each function evaluation relies on the solution of an expensive initial value problem. We show that by leveraging both prior knowledge and an active exploration scheme, BQ significantly reduces the number of required evaluations and thus outperforms Monte Carlo methods on a wide range of integration problems. As a concrete application, we highlight the merits of adopting Riemannian geometry with our proposed framework on a nonlinear dataset from molecular dynamics.

### 28 Apr 11:00

The prior distribution on parameters of a likelihood is the usual starting point for Bayesian uncertainty quantification. In this paper, we present a different perspective. Given a finite data sample of size n from an infinite population, we focus on the missing remainder of the population as the source of statistical uncertainty, with the parameter of interest being known precisely given the entire population. We argue that the foundation of Bayesian inference is to assign a predictive distribution on remainder of the population conditional on the observed sample, which then induces a distribution on the parameter of interest. Demonstrating an application of martingales, Doob shows that choosing the Bayesian predictive distribution returns the conventional posterior as the distribution of the parameter. Taking this as our cue, we relax the predictive machine, avoiding the need for the predictive to be derived solely from the usual prior to posterior to predictive density formula. We introduce the martingale posterior distribution, which returns Bayesian uncertainty directly on any statistic of interest without the need for the likelihood and prior, and this distribution can be sampled through a computational scheme we name predictive resampling. To that end, we introduce new predictive methodologies for multivariate density estimation, regression and classification that build upon recent work on bivariate copulas.

### 21 Apr 11:00

Computing the expectation of some kernel function is ubiquitous in machine learning, from the classical theory of support vector machines, to exploiting kernel embeddings of distributions in applications ranging from probabilistic modeling, statistical inference, casual discovery, and deep learning. In all these scenarios, we tend to resort to Monte Carlo estimates as expectations of kernels are intractable in general. In this work, we characterize the conditions under which we can compute expected kernels exactly and efficiently, by leveraging recent advances in probabilistic circuit representations. We first construct a circuit representation for kernels and propose an approach to such tractable computation. We then demonstrate possible advancements for kernel embedding frameworks by exploiting tractable expected kernels to derive new algorithms for two challenging scenarios: 1) reasoning under missing data with kernel support vector regressors; 2) devising a collapsed black-box importance sampling scheme. Finally, we empirically evaluate both algorithms and show that they outperform standard baselines on a variety of datasets.

### 31 Mar - 14 Apr

Easter Break

### 24 Mar 11:00

Kernelized Stein discrepancy (KSD), though being extensively used in goodness-of-fit tests and model learning, suffers from the curse-of-dimensionality. We address this issue by proposing the sliced Stein discrepancy and its scalable and kernelized variants, which employs kernel-based test functions defined on the optimal onedimensional projections instead of the full input in high dimensions. When applied to goodness-of-fit tests, extensive experiments show the proposed discrepancy significantly outperforms KSD and various baselines in high dimensions. For model learning, we show its advantages by training an independent component analysis when compared with existing Stein discrepancy baselines. We further propose a novel particle inference method called sliced Stein variational gradient descent (S-SVGD) which alleviates the mode-collapse issue of SVGD in training variational autoencoders.

### 10 Mar 11:00

This article focuses on numerical issues in maximum likelihood parameter estimation for Gaussian process regression (GPR). This article investigates the origin of the numerical issues and provides simple but effective improvement strategies. This work targets a basic problem but a host of studies, particularly in the literature of Bayesian optimization, rely on off-the-shelf GPR implementations. For the conclusions of these studies to be reliable and reproducible, robust GPR implementations are critical.

### 03 Mar 11:00

Zhuo Sun (University College London, UK) [**URL**]

**Amortized Bayesian Prototype Meta-learning: A new probabilistic meta-learning approach to few-shot image classification**

Probabilistic meta-learning methods recently have achieved impressive success in few-shot image classification. However, they introduce a huge number of random variables for neural network weights and thus severe computational and inferential challenges. In this paper, we propose a novel probabilistic meta-learning method called amortized Bayesian prototype meta-learning. In contrast to previous methods, we introduce only a small number of random variables for latent class prototypes rather than a huge number for network weights; we learn to learn the posterior distributions of these latent prototypes in an amortized inference way with no need for an extra amortization network, such that we can easily approximate their posteriors conditional on few labeled samples, whenever at meta-training or meta-testing stage. The proposed method can be trained end-to-end without any pre-training. Compared with other probabilistic meta-learning methods, our proposed approach is more interpretable with much less random variables, while still be able to achieve competitive performance for few-shot image classification problems on various benchmark datasets. Its excellent robustness and predictive uncertainty are also demonstrated through ablation studies.

### 24 Feb 11:00

Deep generative models have shown great success when it comes to fitting probabilistic models to complex data. Applications range from computer vision and speech to biogenetic and climate science. Such data is often naturally described on Riemannian manifolds such as spheres, tori, and hyperbolic spaces. Additionally, even when the data live on a Euclidean space, it may have a latent non-Euclidean geometry. Yet, most deep generative models implicitly assume a ﬂat geometry, making them either misspeciﬁed or potentially ill-suited to these situations. To tackle such issues, we introduce Poincaré Variational Auto-Encoders and Riemannian Continuous Normalizing Flows respectively modelling data with underlying hierarchical structure, and parametrising probability measures on smooth manifolds.

### 17 Feb 11:00

The Bayesian treatment of neural networks dictates that a prior distribution is specified over their weight and bias parameters. This poses a challenge because modern neural networks are characterized by a large number of parameters, and the choice of these priors has an uncontrolled effect on the induced functional prior, which is the distribution of the functions obtained by sampling the parameters from their prior distribution. We argue that this is a hugely limiting aspect of Bayesian deep learning, and this work tackles this limitation in a practical and effective way. Our proposal is to reason in terms of functional priors, which are easier to elicit, and to "tune" the priors of neural network parameters in a way that they reflect such functional priors. Gaussian processes offer a rigorous framework to define prior distributions over functions, and we propose a novel and robust framework to match their prior with the functional prior of neural networks based on the minimization of their Wasserstein distance. We provide vast experimental evidence that coupling these priors with scalable Markov chain Monte Carlo sampling offers systematically large performance improvements over alternative choices of priors and state-of-the-art approximate Bayesian deep learning approaches. We consider this work a considerable step in the direction of making the long-standing challenge of carrying out a fully Bayesian treatment of neural networks, including convolutional neural networks, a concrete possibility.

### 03 Feb 14:00

Jean Honorio (Purdue University, US) [**URL**]

**Theoretical Foundations of Combinatorial Problems in Machine Learning**

Structured prediction can be thought of as a simultaneous prediction of multiple labels. This is often done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise and unary potentials. The above is naturally modeled with a graph, where edges and vertices are related to pairwise and unary potentials, respectively. We consider the generative process proposed by Globerson et al. 2015, and apply it to general connected graphs. We analyze the structural conditions of the graph that allow for the exact recovery of the labels. Our results show that exact recovery is possible and achievable in polynomial time for a large class of graphs. In particular, we show that graphs that are bad expanders can be exactly recovered by adding small edge perturbations coming from the Erdős-Rényi model.

We also extend our results to account for fairness. In contrast to the known trade-offs between fairness and model performance, the addition of the fairness constraint improves the probability of exact recovery. We effectively explain this phenomenon and empirically show how graphs with poor expansion properties, such as grids, are now capable to achieve exact recovery with high probability.

The two results above serve as a gentle introduction to a unifying framework, which uses the power of convex relaxations, Karush-Kuhn-Tucker conditions, primal-dual certificates and concentration inequalities. This framework has allowed us to produce novel algorithms for several NP-hard combinatorial problems, such as learning Bayesian networks, graphical games, learning and inference in structured prediction, and community detection.

### 27 Jan 11:00

Deep Gaussian processes with importance weighted variational inference are a powerful model and inference scheme which can represent complex, non-Gaussian marginal distributions while maintaining many of the advantages of standard GPs. However, we highlight a potential shortcoming of this approach: the signal-to-noise ratio of the gradient estimates of specific variational parameters can degrade during training, leading to a poorer variational approximation and thus worse predictive performance. In this talk I will give background information on deep Gaussian processes and importance weighted variational inference, and discuss why we might be interested in them. I will then present our investigation into the degraded signal-to-noise ratio during training, providing both theoretical and empirical evidence of the issue, and demonstrating how we can solve it.

### 20 Jan 11:00

Calibrating stochastic radio channel models to new measurement data is challenging when the likelihood function is intractable. The standard approach to this problem involves sophisticated algorithms for extraction and clustering of multipath components, following which, point estimates of the model parameters can be obtained using specialized estimators. We propose a likelihood-free calibration method using approximate Bayesian computation. The method is based on the maximum mean discrepancy, which is a notion of distance between probability distributions. Our method not only by-passes the need to implement any high-resolution or clustering algorithm, but is also automatic in that it does not require any additional input or manual pre-processing from the user. It also has the advantage of returning an entire posterior distribution on the value of the parameters, rather than a simple point estimate. We evaluate the performance of the proposed method by fitting two different stochastic channel models, namely the Saleh-Valenzuela model and the propagation graph model, to both simulated and measured data. The proposed method is able to estimate the parameters of both the models accurately in simulations, as well as when applied to 60 GHz indoor measurement data.