JST CREST 

Yoshida, Khan Teams 

Joint workshop

Date: September 14th 2023

Registration and location

Physical: Room 123, Graduate School of Mathematical Sciences, University of Tokyo, Komaba campus. Please register from  https://forms.gle/nfdvvq9hYGm8WK687

Zoom: https://us06web.zoom.us/meeting/register/tZUvcOCoqjwjEtItXnG_tXXvezbDDKQVcAA2

Program

Overview (11:00 - 12:00)

Hiroki Masuda / Emtiyaz Khan


12:00 - 14:00 Lunch & coffee


14:00 - 15:30 Talks - Khan Group


15:30 - 16:00 Coffee break


16:00 - 17:30 Talks - Yoshida Group 


17:30 - 18:00 Wrap-up coffee

Abstract for Overview 11:00-12:00

Hiroki Masuda (University of Tokyo)

Project overview in brief

We present a brief overview of our ongoing projects consisting of five groups. We promote research in various fields related to time series data. The main and common theme is to create and develop a comprehensive system for statistical modeling and statistical analysis of huge dependent data based on probability theory and mathematical statistics principles. Statistical and simulation techniques for stochastic processes built on rigorous mathematics enable exploration and modeling of dependency that traditional time series analysis could not address, for accurate prediction and stochastic control.

Emtiyaz Khan (RIKEN AIP) 

Overview of the Bayes-duality project

Humans and animals have a natural ability to autonomously learn and quickly adapt to their surroundings. How can we design machines that do the same? We present the Bayesian-duality principle to solve this problem. The principles uses the dual perspective of the Bayesian learning rule and yields new mechanisms for knowledge transfer in learning machines.

Abstract for Khan Group Session 14:00 - 15:30

Thomas Möllenhoff (RIKEN AIP


Sharpness-Aware Minimization as an Optimal Relaxation of Bayes


Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. In this talk, I will show how SAM can be interpreted as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy.



Gian Maria Marconi (RIKEN AIP


Second-order Optimization via Bayes (SOBA)


Second-order methods struggle to perform efficiently in deep learning. They are expensive, unstable and of difficult implementation. We propose SOBA, a deep learning second order optimizer that minimizes a Bayesian objective, improving accuracy over Adam and SGD at no additional cost. SOBA is efficient, easy to implement and use. By using fast Hessian estimates, SOBA reveals the fundamental connection between second-order information and the covariance of model parameters. This can be used to estimate Bayesian uncertainty for free. We show that the uncertainty obtained by SOBA is of high quality, comparable to dedicated Bayesian methods and much better than the commonly used SGD and Adam.



Peter Nickl (RIKEN AIP


The Memory Perturbation Equation: Understanding Model’s Sensitivity to Data


Understanding model’s sensitivity to its training data is crucial not only for safe and robust operation but also for future adaptations. We present the memory-perturbation equation (MPE) which relates model’s sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing influence measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivity. Our empirical results show that sensitivity estimates obtained during training can faithfully predict generalization on unseen test data and avoid the need for expensive retraining. The equation is useful for future research on robust and adaptive learning.



Geoffrey Wolfer (RIKEN AIP


Improved Estimation of Relaxation Time in Non-reversible Markov Chains


The pseudo-spectral gap of a non-reversible ergodic Markov chain, introduced by Paulin [2015], is an important parameter measuring the asymptotic rate of convergence to stationarity. We characterize up to logarithmic factors the minimax trajectory length for the problem of estimating the pseudo-spectral gap of an ergodic Markov chain in constant multiplicative error. Our result recovers the known complexity in the reversible setting for estimating the absolute spectral gap [Levin, Peres 2016, Hsu et al., 2019], and nearly resolves the problem in the general, non-reversible setting. What is more, we strengthen the known empirical procedure by making it fully-adaptive to the data, thinning the confidence intervals and reducing the computational complexity. Along the way, we derive new properties of the pseudo-spectral gap and introduce the notion of a reversible dilation of a stochastic matrix.

Abstract for Yoshida Group Session 16:00-17:30

Kengo Kamatani (ISM)

 Scaling limit of Markov chain/process Monte Carlo methods

The scaling limit analysis of Markov Chain Monte Carlo methods has been a topic of intensive study in recent decades. The analysis entails determining the rate at which the Markov Chain converges to its limiting process, typically a Langevin diffusion process, and provides useful guidelines for parameter tuning. Since the seminal work of Roberts et al. in 1997, numerous researchers have generalized the original assumptions and expanded the results to more sophisticated methods. Recently, there has been growing interest in piecewise deterministic Markov processes for Monte Carlo integration methods, particularly the Bouncy Particle Sampler and the Zig-Zag Sampler. This talk will focus on determining the scaling limits for both algorithms and provide a criterion for tuning the Bouncy Particle Sampler. This is joint work with J. Bierkens (TU Delft) and G. O. Roberts (Warwick).


Masayuki Uchida (Osaka University) 

Statistical parametric estimation for a linear parabolic SPDE in two space dimensions  

We consider statistical estimation of unknown coefficient parameters of a linear parabolic second-order stochastic partial differential equation (SPDE) in two space dimensions driven by a Q-Wiener process based on high frequency spatio-temporal data. We introduce minimum contrast estimators (MCEs) for the unknown parameters of the coordinate process of the SPDE in two space dimensions based on thinned data with respect to space. Utilizing the MCEs, we approximate the coordinate process of the SPDE and obtain parametric adaptive estimators for the coefficient parameters of the SPDE using the approximate coordinate process and thinned data with respect to time. The asymptotic properties of the proposed estimators are shown under certain regular conditions. This is joint work with Yozo Tonaki (Osaka University) and Yusuke Kaino (Kobe University).  

Taiji Suzuki (University of Tokyo)


Mean field Langevin dynamics: Generalization error analysis and its extensions


Neural network in the mean-field regime is known to be capable of feature learning, unlike the kernel (NTK) counterpart. Recent works have shown that mean-field neural networks can be globally optimized by a noisy gradient descent update termed the mean-field Langevin dynamics (MFLD). However, all existing guarantees for MFLD only considered the optimization efficiency, and it is unclear if this algorithm leads to improved generalization performance and sample complexity due to the presence of feature learning. To fill this important gap, we study the sample complexity of MFLD in learning a class of binary classification problems. Unlike existing margin bounds for neural networks, we avoid the typical norm control by utilizing the perspective that MFLD optimizes the distribution of parameters rather than the parameter itself; this leads to an improved analysis of the sample complexity and convergence rate. We apply our general framework to the learning of k-sparse parity functions and show improvement on the sample complexity. 

If time allows, I would like to mention an extension of MFLD to a policy-gradient method.