Abstracts

Invited talks

Shun-ichi Amari, RIKEN Center for Brain Science, (Monday, 16 March, 10:40 - 11:40)

Information Geometry and Wasserstein Distance

Information geometry studies invariant geometrical structures of a manifold of probability distributions. It is successful for elucidating statistical inference, machine learning, signal processing physics, etc., where probability distributions play a fundamental role. On the other hand, Wasserstein distance is responsible for the metric of the base space on which probability distributions are defined. It is not invariant but sensible for the metric structure of the base space. It is formulated as a transportation problem and useful in various applications, in particular computer vision and machine learning. The present talk studies the Wasserstein geometry, in particular its entropy-regularized version, by using the method of invariant information geometry.

Nihat Ay, Max Plank Institute for Mathematics in the Sciences, (Monday, 16 March, 14:20 - 15:20)

On the Natural Gradient for Deep Learning

The natural gradient method is one of the most prominent information-geometric methods within the field of machine learning. It was proposed by Amari in 1998 and uses the Fisher-Rao metric as Riemannian metric for the definition of a gradient within optimisation tasks. Since then it proved to be extremely efficient in the context of neural networks, reinforcement learning, and robotics. In recent years, attempts have been made to apply the natural gradient method for training deep neural networks. However, due to the huge number of parameters of such networks, the method is currently not directly applicable in this context. In my presentation, I outline ways to simplify the natural gradient for deep learning. Corresponding simplifications are related to the locality of learning associated with the underlying network structure.

David Collins, Colorado Mesa University, (Thursday, 19 March, 15:50 - 16:50)

Qubit channel parameter estimation with noisy initial states

Quantum systems can evolve via parameter dependent physical processes, or channels, and estimating such parameters is important in many contexts. Physically estimating the parameter associated with a channel of known type involves subjecting quantum systems to one or more copies of the channel; the choices of initial states for the systems, intermediate parameter-independent operations and final measurements can all affect the eventual estimation accuracy. This accuracy can be quantified in terms of the quantum Fisher information per channel invocation, which can be determined from the system state after the final channel invocation. Much of the work in quantum estimation has focused on absolute optimal estimation protocols for various channels under any physical circumstance. This typically yields absolute bounds on the quantum Fisher information but assumes resources such as pure initial states or arbitrary access to ancilla systems. We describe situations where some of these resources are unavailable and the general results from absolute optimal estimation protocols may no longer apply. Specifically we consider cases of qubit channel estimation where the available initial states are not pure. We compare two protocols: one where the input states into the channel are uncorrelated states generated independently from the individual qubit initial states and the other where the input states are prepared from the same initial states using a particular multi-qubit correlating preparatory unitary. We show that for a certain qubit channels the correlated state protocol can yield estimation advantages for noisy initial states whereas a comparable protocol would not yield advantages for pure initial states. We consider the special case where the channel is unital and is invoked on one of n qubits. Here for very noisy initial states the correlated state protocol provides an enhancement of the quantum Fisher information by a factor between n and n-1; this enhancement is not always attained when pure initial states are available. These results indicate that the realm in which estimation is constrained by non-optimal resources is distinct from that with optimal resources and offers many avenues for further exploration.

Marco Cuturi, Google, (Friday, 20 March, 10:00 - 11:00)

Geometry of Regularized Wasserstein Distances

We will present in this talk a survey of recent efforts in the field of optimal transport, and more specifically on its regularized variants that exploit the relative entropy of the Kantorovich coupling. We will show how this regularization can yield several variants of the usual optimal transport metrics.

Shinto Eguchi, The Institute of Statistical Mathematics, (Wednesday, 18 March, 10:00 - 11:00)

Information geometry of reinforcement learning

In recent years, the development of reinforcement learning algorithms has enabled the expansion of various applications of AI, and has had a great impact on human society. The main purpose of reinforcement learning is to find the optimal decision rule when the relationship between state, action, and reward governed by time changes stochastically. For example, dynamic treatment resumes (DTR) has been actively studied to determine the optimal treatment for individuals in the treatment of chronic diseases with long-term care. It is necessary to adapt the conventional framework of statistics for prediction and verification to the content of reinforcement learning, but when an action (treatment) is expressed in binary, discriminant analysis approach can be applied and weighted support vector machines have been proposed. In this approach, we consider a semiparametric model of the value function (expected reward) in a framework of information geometry and provide the class of optimal decision rules. Further, we introduce backward inference that does not assume Markov property in the case of multiple stages.

Akio Fujiwara, Osaka University, (Thursday, 19 March, 10:40 - 11:40)

Recent progress in asymptotic quantum statistics

Suppose that one has $n$ copies of a quantum system each in the same state depending on an unknown parameter $¥theta$, and one wishes to estimate $¥theta$ by making some measurement on the $n$ systems together. This yields data whose distribution depends on $¥theta$ and on the choice of the measurement. Given the measurement, we therefore have a classical parametric statistical model, though not necessarily an i.i.d. model, since we are allowed to bring the $n$ systems together before measuring the resulting joint system as one quantum object. In that case the resulting data need not consist of (a function of) $n$ i.i.d. observations, and a key quantum feature is that we can generally extract more information about $¥theta$ using such ``collective'' or ``joint'' measurements than when we measure the systems separately. What is the best we can do as $n$ goes to infinity, when we are allowed to optimize both over the measurement and over the ensuing data processing? The objective of this study is to investigate this question by extending the theory of local asymptotic normality, which is known to form an important part of the classical asymptotic theory, to quantum statistical models. (This is joint work with Koichi Yamagata and Richard D. Gill.)

Minh Ha Quang, RIKEN AIP, (Tuesday, 17 March, 17:00 - 18:00)

Information Geometry and Optimal Transport in Reproducing Kernel Hilbert Spaces

Information Geometry and Optimal Transport are playing increasingly important roles in modern machine learning research. This talk will survey some recent results in Information Geometry and Optimal Transport obtained in the reproducing kernel Hilbert space (RKHS) setting, a major paradigm of machine learning. One particular advantage of the RKHS setting is that many quantities of interest can be expressed in closed form via the kernel Gram matrices. The mathematical formulations will be accompanied by machine learning applications.

Hideyuki Ishi, Osaka City University, (Monday, 16 March, 15:50 - 16:50)

On the Berezin-Wallach-Gindikin-Jorgensen set

It is well-known that the Kullback-Leibler divergence for an exponential family is a Bregman divergence. But the converse is not true. In other words, a strictly convex function is not necessarily a cumulant (generating) function of a positive measure. Indeed, a positive multiple of a cumulant function is not necessarily a cumulant function. The Jorgensen set of a positive measure $\mu$ is the set of positive numbers $c$ such that $c$ times the cumulant function of $\mu$ is again a cumulant function of some positive measure. Clearly the Jorgensen set contains all positive integers, while it is often a hard problem to determine the Jorgensen set of a given measure. The Jorgensen set of the classical Wishart distribution is called the Berezin set, the Wallach set, or the Gindikin set, which has been studied in the context of various areas of mathematics such as complex geometry, representation theory, and the mathematical physics. In this talk, I discuss stimulating interactions between statistics and these areas through the Berezin-Wallach-Gindikin-Jorgensen set in a general setting.

Mohammad Emtiyaz Khan, RIKEN AIP, (Wednesday, 18 March, 15:50 - 16:50)

Bayesian Learning Rule: Combining information geometry, optimization, and statistics to improve deep learning

I will present a new learning rule called the Bayesian learning rule which enables us to connect a wide-variety of learning algorithms. The rule is derived by borrowing and combining ideas from information geometry, continuous optimization, and Bayesian statistics. A striking property of the rule is that it enables us to unify, generalize, and improve a wide-range of existing learning-algorithms. Not only this but it also enable us to design new algorithms, e.g., to improve several aspects of deep learning.

Fumiyasu Komaki, The University of Tokyo, (Wednesday, 18 March, 14:20 - 15:20)

Information geometry of multivariate Poisson distributions and its applications

Prior densities for constructing Bayesian predictive densities are investigated from the viewpoint of information geometry. Kullback-Leibler divergence from the true density to a predictive density is adopted as a loss function. There exist shrinkage predictive densities asymptotically dominating Bayesian predictive densities based on the Jeffreys prior or other vague priors if the model manifold endowed with the Fisher metric satisfies some differential geometric conditions. By using the results based on information geometry, shrinkage priors for a class of Poisson regression models is constructed Bayesian predictive densities based on the shrinkage priors exactly dominate those based on Jeffreys priors under the Kullback-Leibler loss.

Takeru Matsuda, The University of Tokyo, (Thursday, 19 March, 17:00 - 18:00)

Minimax estimation of quantum states based on the conditional Holevo mutual information

We investigate minimax estimation of quantum states under the quantum relative entropy loss. We show that the Bayes estimator with respect to a prior that maximizes the conditional Holevo mutual information is minimax. For one qubit system, we also provide a class of measurements that is optimal from the viewpoint of minimax state estimation.

Aditya Menon, Google Research, (Tuesday, 17 March, 14:20 - 15:20)

Multilabel retrieval: a loss function perspective

Multilabel retrieval (MLR) is the problem of identifying a small number of relevant labels for an instance, given a large (on the order of millions or billions) set of candidate labels. Several core challenges in MLR boil down to the design of suitable loss functions. For example, how can one design losses that are computationally efficient, yet statistically consistent? How can one ensure that retrieval is “fair” across data subpopulations? And how can one modify losses so as to cope with noise in the training set? In this talk, we present some recent proposals to address each of these problems. These proposals build on simple primitives -- in particular, those of convex risk measures, and the family of proper loss functions -- and yield effective solutions with desirable theoretical and practical performance. We conclude with a discussion of open challenges in MLR, and other potential use-cases for the studied loss functions.

Milan Mosonyi, Budapest University of Technology and Economics, (Thursday, 19 March, 14:20 - 15:20)

Rényi divergence radii in quantum information theory

There are various inequivalent ways to define the entropic capacity of a classical-quantum channel. In this talk we consider the geometric notion of Rényi divergence radius, and show that it gives the operationally relevant quantification of the capacity in constant composition coding.

Hiroshi Nagaoka, The University of Electro-Communications, (Monday, 16 March, 17:00 - 18:00)

T.B.A.


Frank Nielsen, Sony Computer Science Laboratories Inc., (Tuesday, 17 March, 13:10 - 14:10)

From geometric learning machines to the geometry of learning


Sebastian Nowozin, Google Research, (Tuesday, 17 March, 15:50 - 16:50)

T.B.D.


Mahito Sugiyama, National Institute of Informatics, (Tuesday, 17 March, 10:00 - 11:00)

Learning with Dually Flat Structure and Incidence Algebra

Statistical manifolds with dually flat structures, such as an exponential family, appear in various machine learning models. In this talk, I will introduce a close connection between dually flat manifolds and incidence algebras in order theory and present its application to machine learning. This approach allows us to flexibly design hierarchically structured sample spaces, which include a number of machine learning problems such as learning of Gibbs distributions, tensor decomposition, and blind source separation.

Asuka Takatsu, Tokyo Metropolitan University, (Friday, 20 March, 11:10 - 12:10)

Invariant metric under deformed Markov embeddings with overlapped supports

Due to Čencov's theorem, there exists a unique family of invariant (0,2)-tensors on the n-dimensional probability simplex indexed by natural numbers n under Markov embeddings. We prove that this invariance does not depend on the geometry on the probability simplex, but only on the feature of Markov embeddings related to sufficient statistics. We deform Markov embeddings with sufficiency in mind and construct a family of invariant (0,2)-tensors under the embeddings.

Ting-Kam Leonard Wong, University of Toronto, (Monday, 16 March, 13:10 - 14:10)

Logarithmic divergences: from finance to optimal transport and information geometry

Divergences, such as the Kullback-Leibler divergence, are distance-like quantities which arise in many applications in probability, statistics and data science. We introduce a family of logarithmic divergences which is a non-linear extension of the celebrated Bregman divergence. It is defined for any exponentially concave function (a function whose exponential is concave). We motivate this divergence by mathematical finance and large deviations of Dirichlet processes. It also arises naturally from the solution to an optimal transport problem. The logarithmic divergence enjoys remarkable mathematical properties including a generalized Pythagorean theorem, and induces a generalized exponential family of probability densities. We also discuss on-going research about novel statistical methodologies arising from our new divergences.

Jun Zhang, University of Michigan, (Wednesday, 18 March, 17:00 - 18:00)

Statistical Mirror Symmetry

A parametric statistical model is a family of probability density functions over a given sample space, whereby each function is indexed by a parameter taking value in some subset of Rn. Treating parameterization as a local coordinate chart, the family forms a manifold M endowed with a Riemannian metric g given by the Fisher-information (the well-known Fisher-Rao metric). The classical theory of information geometry (which we call A-Model) also prescribes a family of dualistic, torsion-free alpha-connections constructed from Amari-Chensov tensor as deformation from the Levi-Civita connection associated with g. Here we prescribe an alternative geometric framework of the manifold M by treating the parameter as an affine parameter of a flat connection and then prescribing its dual connection (with respect to g) \nabla as one that is curvature-free but carries torsion (which we call B-model). We then investigate properties of the tangent bundle TM based on the Sasaki lift of g and a canonical split using data from the base manifold M (i.e., either A- or B-model). For A-model, TM has the structure of an almost Kahler manifold, with an alpha-dependent almost complex structures yet an identical symplectic structure for all alpha-connections which, when pushed forward to the cotangent bundle T*M, is its canonical symplectic form. For B-model, TM has the structure of a Hermitian manifold constructed from the flat connection and an almost Kahler structure constructed from \nabla which, when pushed forward to T*M, is also the canonical symplectic structure. Therefore, we establish a “mirror correspondence'' between a Hermitian structure on TM and an almost Kahler structure on T*M, each constructed from one of the pair of dual connections in the B-model. In analogous to mirror symmetry in string theory, we call this “statistical mirror-symmetry,'' and speculate its meaning in the context of statistical inference. (Joint work with Gabriel Khan).

Tutorial

Fuyuhiko Tanaka, Osaka University, (Thursday, 19 March, 10:00 - 10:30)

Introduction to Quantum Information

Short talks

Pierre Alquier, RIKEN AIP, (Wednesday, 18 March, 13:10 - 13:25)

A Generalization Bound for Online Variational Inference

Bayesian inference provides an attractive online-learning framework to analyze sequential data, and offers generalization guarantees which hold even with model mismatch and adversaries. Unfortunately, exact Bayesian inference is rarely feasible in practice and approximation methods are usually employed, but do such methods preserve the generalization properties of Bayesian inference ? In this paper, we show that this is indeed the case for some variational inference (VI) algorithms. We consider a few existing online, tempered VI algorithms, as well as a new algorithm, and derive their generalization bounds. Our theoretical result relies on the convexity of the variational objective, but we argue that the result should hold more generally and present empirical evidence in support of this. Our work in this paper presents theoretical justifications in favor of online algorithms relying on approximate Bayesian methods.

Goffredo Chirco, Romanian Institute of Science and Technology, (Thursday, 19 March, 13:25 - 13:40)

Bregman-Lagrangian Dynamics on the Non-parametric Statistical Bundle

Building on the notion of Bregman Lagrangian, Wibisono et al. (2016) recently proposed a variational formulation for higher-order gradient methods. Such a unifying generative framework for accelerated gradient flows, initially defined in a Euclidean setting, has been recently extended to the space of probability distributions from different perspectives, see, e.g. Wang & Li, 2019; Taghvaei and Mehta, 2019, gaining a growing interest in the machine learning community. Interestingly, from an information-geometric viewpoint, the existence and the convergence properties of a geometric counterpart for accelerated gradient flows on a probability space rely heavily both on the statistical model considered and on the specific form of the Lagrangian.

In our work, we investigate the variational approach to accelerated natural gradient flows on a probability space while framing it within a general study of the Lagrangian and Hamiltonian formalism of classical mechanics in information geometry. In this setting, we start from the essential ingredients of a maximal exponential model, we consider a Bregman divergence which endows the model with a Hessian geometry, and we focus on the generalization of the recently proposed notion of Bregman-Lagrangian to the exponential manifold, to derive second-order optimization dynamics in general variational terms.

We review the definitions and constructs coming from non-parametric information geometry for the exponential statistical bundle, the affine atlas endowing it with a manifold structure, and a natural family of parallel transports between the fibers. We present the fiber bundle formalism for the statistical manifold and derive the expression of the score velocity and the associated acceleration for a one-dimensional model in the given affine atlas.

Further, in the definition of the Lagrangian system, we discuss the fundamental role played by retraction maps on the bundle, leading to a tangent bundle formulation for the divergence function on the statistical model, with the peculiarity of having a non-quadratic and non-symmetric Bregman divergence playing the role of kinetic energy.

Finally, we derive the Lagrangian construction and the Euler-Lagrange equations on the exponential manifold for our statistical reformulation of the Bregman-Lagrangian system, and we investigate the symplectic structure of the statistical bundle and the dual Hamiltonian formulation of the variational approach. Here, the natural compatibility between the dually-flat geometry of statistical models and Hamiltonian mechanics is apparent. Our general geometric-mechanic analysis is timely concerning the optimization of functions defined over statistical models, and it paves the way to further generalizations.

Domenico Felice, MPI MiS - Leipzig, (Thursday, 19 March, 13:40 - 13:55)

Towards a canonical divergence within Information Geometry

In Riemannian geometry geodesics are integral curves of the Riemannian distance gradient. This result is achieved via the celebrated Gauss Lemma. In Information Geometry, the situation is more complicated because a general object is a Riemannian manifold endowed with a couple of affine connections which are dual with respect to the metric tensor. Relying on both the affine connections, we build up two vector fields and show that the sum of them generates the rays of level-sets defined by a suitable pseudo-distance. Very remarkably, we get back the classical result of Riemannian geometry in the self-dual case. Based on these two vector fields, we define a canonical divergence for a general dualistic structure $(g, \nabla, \nabla^*)$ given on a smooth manifold Μ. We then exhibit some features of this divergence and show that it reduces to the canonical divergence proposed by Ay and Amari in the case of: (a) self-duality, (b) dual flatness, (c) statistical geometric analogue of the concept of symmetric spaces in Riemannian geometry. The case (c) leads to a further comparison of the recent divergence with the one introduced by Henmi and Kobayashi.

Deepika Kumari, Romanian Institute of Science and Technology, (Wednesday, 18 March, 13:25- 13:40)

Deformed q-ELBO for Robust Variational Inference

Tsallis introduced the notion of non-extensive entropy, commonly called q-entropy or Tsallis entropy, which is a generalization of the Boltzmann-Gibbs entropy. This led to non-extensive statistical mechanics which uses q-deformed exponentials as a generalization of the exponential function, and motivates the study of generalized exponential families, called q-exponential family. After recalling the definitions of the probability density function for multivariate q-Gaussian distributions, for different parameterizations, and their $q$-mean and $q$-covariance by computing deformed expectations based on escort distributions, we derived a closed-form expression for the $q$-deformed Kullback-Leibler divergence. The q-deformed Gaussian distribution is often favored for its heavy tails in comparison to the standard Gaussian, when $1 < q < 3$. In particular, for positive $q$, deformed Gaussians are equivalent to the Student's $t$-distributions, where the degree of freedom can be expressed as a function of $q$. Due to the heavy tails, such distributions find applications in fields such statistical mechanics, economics, information theory, and machine learning, among the others. Dually to the deformed exponential function, the deformed logarithm plays an important role for the definition of $q$-deformed likelihood in the context of robust maximum likelihood estimation. Deformed $q$-MLE indeed often provides more robust estimation in presence of a small number of observations. In this work we combine the $q$-MLE and the properties of the $q$-deformed Gaussian distribution to derive a novel Evidence Lower Bound (ELBO) for variational inference, called $q$-ELBO. The novel lower bound for the likelihood can be employed for the training of generative models such as Variational AutoEncoders, in the context of Bayesian deep learning. This is a joint work with Septimia Sarbu and Luigi Malagò.

Luigi Malagò, Romanian Institute of Science and Technology, (Wednesday, 18 March, 13:55 - 14:10)

On the Natural Gradient for the Training of Neural Networks

More than twenty years have passed since the proposal of the natural gradient, first described by Amari in his seminal paper in 1998. In the last two decades natural gradient has found applications in many different fields in engineering and data science, such as stochastic optimization, statistics and machine learning. From an optimization perspective the natural gradient is the Riemannian gradient of a function defined over a statistical manifold, endowed with the Fisher-Rao information metric. The natural gradient allows the definition of the direction of steepest descent of a function, in a way which is invariant to the choice of the parameterization. Algorithms based on natural gradient generally show faster convergence rates with respect to the number of iterations, thanks to their capability to avoid plateaux and escape local minima. Such advantages however usually come at the expenses of a higher computation cost per iteration, since in the worst case a linear system needs to be solved at each time step. However, there exist multiple examples of applications in which the natural gradient takes a simplified form, thanks to the fact that the inverse Fisher information matrix combines with the vector of partial derivatives of the function, resulting in simplified formulae, depending on the choice of the parameterization. In this talk we focus in particular on the use of the natural gradient for the training of neural networks, one the first application fields explored by Amari, which nowadays has become extremely popular thanks to many advances in deep learning. Training large networks poses several challenges for methods based on natural gradient, and several approaches have been proposed in the literature to overcome the impact on the computational cost in presence of large networks. We review state of the art algorithms based on natural gradient for the training of neural networks and present novel results and algorithms. We present experimental results which allow to evaluate advantages and disadvantages of different approaches over standard benchmarks.

Hidemasa Oda, The University of Tokyo, (Wednesday, 18 March, 11:20 - 11:35)

Non-informative prior on the Kähler information statistical manifold of complex-valued Gaussian process

We focus on complex-valued Gaussian process, which is known to have a Kähler structure on its statistical complex manifold. The objective is to construct a good Bayesian predictive distribution with respect to the risk defined by KL-divergence. Although the Jeffreys prior on the complex-valued autoregressive model of order p is improper, we manage to find a non-informative prior which asymptotically dominates the Bayesian predictive distribution of the Jeffreys prior.

Tomasz Rutkowski, RIKEN AIP, (Tuesday, 17 March, 11:20 - 11:35)

Riemannian Geometry Machine Learning Methods for EEG and fNIRS Digital Biomarker Development in AI for Aging Societies Application

Dementia and especially Alzheimer’s disease (AD) are the most frequent signs of cognitive decline in older adults. The rise of the cognitive decline in mental health problems in aging societies is causing a significant economic and healthcare burden in many nations around the world. A recent World Health Organization (WHO) report approximates that currently, worldwide, about 50 million people live with a dementia spectrum of neurocognitive disorders. This number will triple by 2050, which calls for possible application of non-pharmacological or AI-based technologies to support early screening for preventive interventions and a subsequent mental wellbeing monitoring as well as maintenance with the so-called digital-pharma (beyond a pill) therapeutical approaches. This abstract presents our recent results of Riemannian geometry machine learning aproaches to brainwave (EEG and fNIRS) classification in order to establish a framework of functional digital biomarkers for dementia progress detection and monitoring. The discussed approach is a step forward to advance AI, and especially machine learning (ML) approaches, for the subsequent application to mild cognitive impairment (MCI) and AD diagnostics.

Hidetoshi Shimodaira, Kyoto University, (Tuesday, 17 March, 11:05 - 11:20)

Selection bias may be adjusted when sample size is negative

For computing p-values, you should specify hypotheses before looking at data. However, people tend to use dataset twice for hypothesis selection and evaluation, leading to inflated statistical significance and more false positives than expected. Recently, a new statistical method, called selective inference or post selection inference, has been developed for adjusting this selection bias. In this talk, I present a bootstrap resampling method with “negative sample size” for computing bias corrected p-values. Geometry plays important role in the theory: our multiscale bootstrap method estimates the signed distance and the mean curvature of the boundary surface of hypothesis region. Examples are shown for confidence interval of regression coefficients after model selection, and significance levels of trees and edges in hierarchical clustering and phylogenetic inference. This is a joint work with Yoshikazu Terada (Osaka University).

Junichi Takeuchi, Kyushu University, (Wednesday, 18 March, 11:05 - 11:20)

MDL Estimators Using Fiber Bundles of Local Exponential Families

The MDL estimators for density estimation, which are defined by two-part codes for universal coding, are analyzed. We give a two-part code for non exponential families whose regret is close to the minimax regret, where regret of a code with respect to a target family $\mathcal{M}$ is the difference between the codelength of the code and the ideal codelength achieved by an element in $\mathcal{M}$. Our code is constructed using a probability density in a fiber bundle of local exponential families of $\mathcal{M}$. Small regret of the two-part code means that the MDL estimator has a small statistical risk because of the theory introduced by Barron and Cover in 1991. This is a joint work with Kohei Miyamoto and Andrew R. Barron.

Koichi Tojo, RIKEN AIP, (Thursday, 19 March, 13:55 - 14:10)

A method to construct exponential families by representation theory

We give a method to construct "good" exponential families systematically by representation theory. More precisely, we consider a homogeneous space G/H as a sample space and construct an exponential family invariant under the action of the group G by using a representation of G. The method generates widely used exponential families such as normal, gamma, Bernoulli, categorical, Wishart, von Mises, Fisher-Bingham and hyperboloid distributions.

Riccardo Volpi, Romanian Institute of Science and Technology, (Thursday, 19 March, 13:10 - 13:25)

Natural Alpha Embeddings and their Impact on Downstream Tasks

Word embeddings are compact representations for the words of a dictionary learned from a large corpus of unsupervised data. Skip-Gram (SG) is a well-known model for the conditional probability of the context of a given central word, which it has been shown to work well at efficiently capturing syntactic and semantic information. SG is at the basis of popular word embeddings algorithms, such as Word2Vec (Mikolov et al. 2013) and GloVe performing a weighted matrix factorization of the global co-occurrences (Pennington Manning 2014). Levy and Goldberg (2014) showed how Word2Vec SG with negative sampling is effectively performing a matrix factorization of the Shifted Positive PMI, expliciting the similarity between the two methods.

A word embedding is often the input of another computational model, to solve more complex inference tasks. The evaluation of the quality of a word embedding, which ideally should encode syntactic and semantic information, is not easy to be determined and different approaches have been proposed in the literature. This can be evaluated in terms of performance on word similarity tasks (Bullinaria Levy 2007, 2012; Pennington et al. 2014; Levy et al. 2015), or by solving word analogies (Mikolov et al. 2013), however more recent work (Tsvetkov et al. 2015; Schnabel et al., 2015) has showed a low degree of correlation between the quality of embeddings for word similarities and analogies on one side, and for downstream tasks (e.g. classification or prediction), to which the embedding is given in input. Several works have highlighted the effectiveness of post-processing techniques (Bullinaria Levy 2007, 2012), such as PCA (Raunak 2017; Mu et al. 2017), focusing on the fact that certain dominant components are not carriers of semantic nor syntactic information, and thus act like noise. A different approach which still acts on the learned vectors after training has been recently proposed by Volpi and Malagò (2019). The authors present a geometrical framework in which word embeddings are represented as vectors in the tangent space of a probability simplex. A family of word embeddings called natural alpha embeddings is introduced, where alpha is a deformation parameter for the geometry of the probability simplex, known in Information Geometry in the context of alpha-connections (Amari Nagaoka 2000; Amari 2016). Noticeably, alpha word embeddings include the classical word embeddings as a special case.

We present an experimental evaluation of natural alpha embeddings over different tasks, showing how the choice of the geometry on the manifold impacts on the performances both on intrinsic and extrinsic tasks. This is a joint work with Luigi Malagò.

Tatsuaki Wada, Ibaraki University, (Wednesday, 18 March, 13:40 - 13:55)

On the gradient-flow equations in information geometry

The gradient-flow equations in information geometry, which was proposed by Fujiwara and Amari, Nakamura more than two decades ago, is revisited from the view point of Hamilton dynamics with a homogeneous Hamiltonian of degree one in the variables $p_i$. The gradient-flow is related to a geodesic flow on a Rimannian manifold, in which the flow is driven by the homogeneous Hamiltonian. Relation to Replicator equations is also pointed out.

Poster Presentations

Akifumi Okuno, RIKEN AIP

Multi-scale k-nearest neighbour

k-nearest neighbour (k-NN) is one of the simplest and the most popular non-parametric methods, and it predicts the query's label by considering observed labels of the k objects nearest to the query. The parameter k ($\in \mathbb{N}$) regulates its bias-variance trade-off; k-NN has smaller variance but larger bias as k increases. In this study, we propose multi-scale k-NN (MS-k-NN), that reduces the bias appeared in the conventional k-NN, by adaptively predicting the label probability through k-NN estimators equipped with several different k. We theoretically prove the favorable properties of the proposed MS-k-NN, and empirically demonstrate that it outperforms existing methods. (This is a joint work with Professor Shimodaira (Kyoto University/RIKEN AIP))

Csongor-Huba Varady, Romanian Institute of Science and Technology, MPI MiS

Natural Wake-Sleep Algorithm for Helmholtz Machines

While Natural Gradient [Amari, 1998, 1997] has been proven to be efficient for first-order optimization in machine learning, its adoption for the training of neural networks has been mostly held back because of its computational cost in the presence of large networks, compared to simpler methods using only partial derivatives. Helmholtz Machines [Dayan et al., 1995] are a particular type of generative models composed by two sigmoid belief networks, acting as an encoder and a decoder, commonly trained using the Wake-Sleep algorithm [Hinton et al., 1995]. For sigmoid belief networks, it has been shown that the Fisher Information matrix assumes a block diagonal structure [Ay, 2002], which can be efficiently exploited to further reduce the computational complexity of the matrix inversion associated with the natural gradient, without the need of further assumptions. In this poster, we present the natural Wake-Sleep algorithm, a geometric adaptation of Wake-Sleep algorithm based on the computation of the natural gradient for the training of Helmholtz Machines. We compare our novel proposed algorithm with state of the art methods, like the Reweighted Wake-Sleep algorithm, in terms of convergence speed, both with respect to number of iterations and time complexity, and convergence to local minima. To further improve the performance of the algorithm we use common techniques, such as delayed update of the Fisher matrix, adaptive methods for the learning rate and the damping factor. Finally we explore the possibility of geometric regularization of the weights during learning. This is a joint work with Riccardo Volpi, Luigi Malagò, and Nihat Ay.

Hector Hortua, Romanian Institute of Science and Technology

Correlated Uncertainties for Regression Problems using Bayesian Neural Networks and Generalized Divergences

Bayesian Neural Networks (BNN) have been proven useful to prevent overfitting during training by providing an effective regularization. By learning a probability distribution over the weights, i.e., the approximate posterior, BNNs are able to provide two types of uncertainty at inference time: the aleatoric one depending on the noise of the data and the epistemic one depending on the confidence of the model, determined by how similar the input is to previous observations. Being able to provide reliable estimations of the uncertainty associated with a prediction, together with the correct calibration of the confidence intervals [Kendall and Gal, 2017], is of great importance in statistics and machine learning for several applications, for instance in physics, computer vision, and economics, among others. Recently it has been demonstrated how it is possible to obtain a reliable prediction of correlated uncertainties in regression problems [Hortua et al. 2019], which is of particular interest in tasks with correlated outputs.

BNNs are commonly trained to find the best approximation for the posterior distribution over the weights in a well behaved family of distributions. This is traditionally achieved in a variational inference setting, by minimizing the Kullback-Leibler divergence between the true posterior and an approximate posterior during training. Computing explicitly the true posterior is usually intractable, however computationally efficient alternatives have been proposed allowing to efficiently learn the approximate posterior [Graves 2011; Kingma and Welling 2014, Rezende et al. 2014]. More recently, alternative approximate variational inference algorithms have been proposed, in which the KL divergence is replaced by other dissimilarity measures between distributions. This is the case for instance of the Renyi divergence [Li and Turner, 2016], the Chi square divergence [Dieng et al., 2017], as well as the alpha-divergence [Hernández-Lobato et al. 2016] first introduced by [Amari, 1985].

In this poster we define a methodological framework illustrating how BNNs can benefit from the use of alternative divergence functions during training, with particular focus on estimating correlated uncertainties for regression problems. In particular we evaluate how the choice of the divergence function impacts on the predicted parameters, their uncertainties, and the calibration methods. This is a joint work with Riccardo Volpi and Luigi Malagò.

Petru Hlihor, Romanian Institute of Science and Technology and MPI MiS

Information Geometric Regularizers for Variational AutoEncoders to Improve Robustness Against Adversarial Examples

Adversarial examples represent a serious security concern for deploying machine learning systems in real life. Indeed, it is possible to slightly modify an image such that it looks almost indistinguishable for a human, but at the same time it manages to fool object recognition classifiers into believing that the image belongs to a different class. In this poster, we propose new methods based on notions of information geometry to train robust classifiers against adversarial examples. We consider a defense mechanism based on input reconstruction obtained through an autoencoder [Gu and Rigazio, 2014], before the image is classified using a neural network. We introduce new regularizers for training robust variational autoencoders against white-box attacks to the combined system, obtained by stacking the two networks, i.e., the autoencoder and the classifier. In particular, the first regularizer we consider is a type of contractive regularizer that explicitly takes into account the non-Euclidean geometry of the space of the posterior distributions for a variational autoencoder. The penalty term is the norm of the Jacobian of the mapping learned by the encoder induced by the Fisher information matrix for the multivariate Gaussian distribution. The second regularizer penalizes models that compute very different latent representations for an input and its reconstruction. We evaluate dissimilarities in the space of the posterior distributions by computing the Kullback-Leibler divergence between two multivariate Gaussians generated by the encoder, as well as other types of geodesics. The purpose of these geometric regularizers is to improve the robustness of the combined system given by the reconstruction and the classification, and make adversarial examples harder to be generated. We evaluate the impact of the novel regularizers over a set of standard benchmarks for image classification.

Michiko Okudo, The University of Tokyo

Projection of Bayesian predictive densities onto finite-dimensional exponential families

Bayesian predictive densities are optimal about the Bayes risk regarding the Kullback-Leibler divergence, and they are often approximated by taking mean of a finite-number of plugin densities. We consider projecting Bayesian predictive densities onto finite-dimensional full exponential families when the model belongs to curved exponential families. It is shown that the posterior mean of the expectation parameter of the full exponential family is optimal about the Bayes risk and it asymptotically coincides with the projection of the Bayesian predictive density onto the exponential family regarding the Fisher metric. Several information-geometric results for the posterior mean of the expectation parameter are obtained in parallel with those for Bayesian predictive densities.

Tasuku Soma, The University of Tokyo

Information geometry of operator scaling

Operator scaling is a quantum generalization of matrix scaling with wide applications in combinatorial optimization, the Brascamp-Lieb inequality, and invariant theory. While matrix scaling has rich connections to information geometry, it was not known that operator scaling admits similar information geometric interpretation. In this poster, we show that the operator Sinkhorn algorithm (Gurvits 2006) for operator scaling coincides with alternating e-projection in the symmetric logarithmic derivative metric. Our result generalizes the well-known result that the Sinkhorn algorithm for matrix scaling is alternating e-projection in the Fisher metric. This is joint work with Takeru Matsuda (The University of Tokyo and RIKEN CBS).