Multivariate Gaussians and Algebraic Statistics
The Gaussian distribution undoubtedly plays an all-important role in statistics, with numerous characterizing properties, one of them being the maximum entropy distribution among all distributions over the real numbers with a fixed mean and variance. In this talk we look at multivariate Gaussians and their discrete counterparts from the lens of algebraic statistics, introducing intrinsic algebraic varieties associated to them. This point of view has powerful implications for inference and data analysis with such models.
Information Geometry of the Otto Metric
We introduce the dual of the mixture connection with respect to the Otto metric which represents a new kind of exponential connection. This provides a dual structure, called the Wasserstein dual structure, consisting of the mixture connection, the Otto metric as a Riemannian metric, and the new exponential connection. We derive the geodesic equation of this exponential connection, which coincides with the Kolmogorov forward equation of a gradient flow. We then derive the canonical contrast function of the introduced dual structure.
On Parameterizing Optimal Transport with Elastic Costs
I will present in this talk an overview of the computations of optimal transport, focusing in particular on the challenge of computing OT maps using two samples from high-dimensional probability measures. After reviewing a few of the popular methods that have been explored for this task recently, including those leveraging neural architectures, I will introduce our recent work on parameterising OT problems with elastic costs, i.e. ground costs that mix the classic squared Euclidean distance with a regularizer (e.g. L1 norm). After highlighting the properties of OT maps that follow such costs, I will present a method to compute ground truth OT maps with elastic costs and also a method to learn the parameters, adaptively, of such a regularizer.
Exploring quantum local asymptotic normality: Insights from information geometry
This talk explores a theory of weak local asymptotic normality in the quantum domain. This framework is applicable to a wide range of quantum statistical models that satisfy some mild regularity conditions, providing a solid foundation for asymptotic quantum statistics. Central to this theory is a novel noncommutative analogue of the Lebesgue decomposition, inspired by the symmetric logarithmic derivative e-connection in quantum information geometry.
Nonequilibrium thermodynamics based on information geometry
We discuss a formulation of nonequilibrium thermodynamics based on information geometry. Dissipation, namely the entropy production rate, is introduced by the Kullback-Leibler divergence, and the information geometry projection introduces the decomposition of the entropy production rate based on the physical role of dissipation. We also discuss a connection between optimal transport theory and information geometry from the point of view of nonequilibrium thermodynamics.
Information Geometry of Chemical Reaction Dynamics and Thermodynamics
Extending information geometry to dynamical systems represents a long-standing challenge. By focusing on chemical reaction dynamics—a class of dynamical systems on graphs and hypergraphs—, we introduce its information geometric structure.
This structure, naturally derived from the thermodynamic properties of these systems, comprises doubly dual flat structures defined respectively on the state space and the tangent-cotangent space, connected by the system's topology. We demonstrate the applicability of this structure in analyzing non-equilibrium reaction systems and controlling stochastic chemical reactions.
Information Geometry of Lévy Measures for Subordinators and Bayesian Prediction
Under certain natural requirements, the Kullback-Leibler divergence serves as the unique criterion for evaluating the performance of a predictive density. In the contexts of the normal distribution model and the Poisson distribution model, there exists a fundamental connection between Bayesian prediction and Bayesian parameter estimation under this criterion, which plays a central role in the decision theory of prediction.
Extending beyond normal and Poisson models, in certain infinitely divisible models known as subordinators, including the gamma distribution model, Bayesian prediction corresponds not to the model parameters but to the Bayesian estimation of the Lévy measure. This presentation explores the relationship between predictive theory and the information geometry of the space of Lévy measures.
Scaling Limits of the Wasserstein information matrix on Gaussian Mixture Models
We consider the Wasserstein metric on the Gaussian mixture models (GMMs), which is defined as the pullback of the full Wasserstein metric on the space of smooth probability distributions with finite second moment. It derives a class of Wasserstein metrics on probability simplices over one-dimensional bounded homogeneous lattices via a scaling limit of the Wasserstein metric on GMMs. Specifically, for a sequence of GMMs whose variances tend to zero, we prove that the limit of the Wasserstein metric exists after certain renormalization. Generalizations of this metric in general GMMs are established, including inhomogeneous lattice models whose lattice gaps are not the same, extended GMMs whose mean parameters of Gaussian components can also change, and the second-order metric containing high-order information of the scaling limit. We further study the Wasserstein gradient flows on GMMs for three typical functionals: potential, internal, and interaction energies. Numerical examples demonstrate the effectiveness of the proposed GMM models for approximating Wasserstein gradient flows. This is based on a joint work with Jiaxi Zhao (NUS).
Information geometry of Wasserstein statistics
Whereas the Kullback--Leibler divergence plays a central role in statistical inference and information geometry, the Wasserstein distance induces another geometric structure in statistical models through optimal transport. We examine the statistical properties of parameter estimation and prediction when the Wasserstein distance is used in place of the Kullback--Leibler divergence. We also explore the statistical interpretation of Wasserstein counterparts of information geometric concepts, such as the Wasserstein information matrix, Wasserstein score function, and the Wasserstein--Cramer--Rao inequality.
Information Geometry and Asymptotics for Kronecker Covariance Matrices
We explore the information geometry and asymptotic behavior of estimators for Kronecker-structured covariances, in both growing-n and growing-p scenarios, with a focus towards examining the estimator proposed by Linton and Tang, which we refer to as the partial trace estimator. It is shown that the partial trace estimator is asymptotically inefficient. An explanation for this inefficiency is that the partial trace estimator does not scale sub-blocks of the sample covariance matrix optimally. To correct for this, an asymptotically efficient, rescaled partial trace estimator is introduced. Motivated by this rescaling, we introduce an orthogonal parameterization for the set of Kronecker covariances. High-dimensional consistency results using the partial trace estimator are obtained that demonstrate a blessing of dimensionality. In settings where an array has at least order three, it is shown that as the array dimensions jointly increase, it is possible to consistently estimate the Kronecker covariance matrix, even when the sample size is one.
Distance Profiles for Random Objects and Applications to Metric Data Analysis and Conformal Inference
The underlying probability measure of random objects situated in a metric measure space can be characterized by distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius under mild regularity conditions.
Harvesting pairwise optimal transports between distance profiles leads to a measure of centrality for random objects that is useful for data analysis in metric spaces, including a novel profile MDS. In the presence of Euclidean (vector) predictors, conditional average transport costs to transport a given distance profile to all other distance profiles can serve as conditional conformity scores. In conjunction with the split conformal algorithm these scores lead to conditional prediction sets with asymptotic conditional validity. This presentation is based on joint work with Yaqing Chen (Rutgers) and Paromita Dubey (USC), and with Hang Zhou (Davis).
Including Symmetries, Geometry, PDE and complex knowledge into Learning Systems with applications
Complex real-world problems, in particular in the sciences and medicine, often systematically have limited data. This makes training of large learning systems such as deep learning models difficult and suboptimal results may occur. Over the years, different fields have developed various techniques for embodying prior information into models, e.g. inclusion of Symmetries, Equivariance, Geometry, PDE or complex knowledge. The talk will give a selective introduction, present recent developments and show applications to quantum chemistry and computer vision.
On the autoparallelity in classical and quantum information geometry
The autoparallelity is the condition implying that a submanifold is "closed" with respect to an affine connection. It is well known that the autoparallelity plays important roles in classical information geometry for the space of probability distributions in that, for instance, it characterizes the notions of exponential family and mixture family. We first review the autoparallelity in classical information geometry from several viewpoints, and then turn our attention to the quantum information geometry for the space of density operators, where the autoparallelity with respect to a non-flat connection turns out important due to the presence of non-vanishing torsion. We discuss some results and problems on this nontrivial autoparallelity. The content of this talk is partially based on joint work with Akio Fujiwara.
Toward Information Geometry of singular models
The dually flat structure on a Riemannian manifold introduced by Amari-Nagaoka takes a central role in information geometry. However, in practical applications, the metric is often degenerate, such as the Fisher-Rao metrics of deep neural networks, Gaussian mixtures, and hidden Markov models, and then the dually flat structure cannot be defined on the entire space. Our purpose is to establish a theory for those singular models based on the dually flat structure. In this talk, I would like to introduce our generalization of the dually flat structure, which admits a degenerate metric. In particular, the generalized Pythagorean theorem and the projection theorem are suitably reformulated in this general setup. The key to our theory is Legendre duality, which still holds between the graphs of non-convex or multi-valued potential functions.
A Unified Information Geometric Perspective on Machine Learning in Structured Spaces
I will present a unified information geometric perspective on various machine learning and data analysis approaches in structured spaces. These include matrix and tensor decomposition, mode interaction selection, data augmentation, and pattern (itemset) mining. By embedding incidence algebras from order theory into the log-linear model, we can construct flexible statistical models over partially ordered sets (posets). This approach can be viewed as a generalization of Boltzmann machines, providing a way for interpreting and optimizing complex data structures across a wide range of applications.
Information geometry of Markov chains: structure arising from Markov-centric properties
In this talk, we will discuss the information geometry structure associated with families of stochastic matrices in the context of Markov chains. In the first part, we will provide a brief overview of the geometric framework developed in the literature, and will highlight key parallels with the established framework over distributions, particularly in terms of statistical properties such as large deviations, parameter estimation, and hypothesis testing. In the second part of the talk, we will shift our focus to properties that are not pertinent to iid processes but are well-established for Markov chains. Specifically, Markov processes enjoy a far richer structure than their iid counterparts, and we will explore how additional Markov-centric properties translate into geometric features of the corresponding families of stochastic matrices. In particular, we will emphasize two properties of key interest to statisticians: first, time-reversibility, which is assumed in many physical processes and used in various computer science algorithms; second, lumpability, which is a natural method for compressing the state space of a Markov chain.
Bregman-Wasserstein divergence: geometry and application
The Bregman-Wasserstein divergence is the optimal transport cost between probability distributions when the cost function is a Bregman divergence. In the first part of the talk we study the generalized dualistic structure it induces on the space of probability distributions, thus showing some deep connections between information geometry and optimal transport. In the second part we turn to numerical methods. We introduce the Bregman-Wasserstein JKO scheme and apply it to Riemannian Wasserstein gradient flows and distributionally robust optimization. Based on joint works with Cale Rankin and Amanjit Kainth.
Transfer learning for Gaussian Processes
We explore transfer learning in the context of Gaussian processes (GPs), focusing on a scenario with one target GP and multiple source GPs. The target GP, which may lack precision, is projected onto the space spanned by the source GPs to enhance its accuracy by leveraging the common features present in the source GPs. To achieve this, we first introduce a definition of the Kullback-Leibler (KL) divergence between two GPs, calculated as the average KL divergence between their respective Gaussians over N sample points, estimated using Monte Carlo integration. We then propose an algorithm for projecting the target GP onto the space of source GPs by finding a mixture of the source GPs that minimizes the KL divergence with the target GP. From an information-geometrical perspective, the resulting mixture is consistent only when using an m-mixture. Therefore, it is natural to employ the e-projection from the target GP.
An operator-theoretic dimension reduction of generative models
In a score generative model (Song & Ermon 2019) for a target probability measure, a stochastic process started with samples from the target is reversed approximately using learned ``scores'' (gradient log densities) of the densities of the former stochastic process. Both the forward and backward processes mainly sample densities in high dimensions, even if the target measure has a low-dimensional support. Score learning and the backward process can exhibit numerical instabilities when the target is singular but with absolutely continuous low-dimensional conditionals. In this work, we view the generating process (the backward/reversed process) as a dynamical system that transforms a probability density (achieved by the forward process) to a density close to the target measure. We derive a reduced dynamical system by lifting to an infinite-dimensional feature space and approximating the stochastic Koopman/Kolmogorov backward operator. This approach provides a mechanistic understanding of slowly varying versus fast scales in generative models. Moreover, it leads to a tractable algorithm that replaces unstable score learning schemes with stable reduced deterministic dynamics (of coefficients in a chosen basis). We prove convergence guarantees for such a scheme for a class of targets with absolutely continuous conditionals on an (unknown) lower-dimensional manifold.
Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence
The Kullback-Leibler (KL) divergence is a canonical loss function used for training probabilistic machine learning models. The natural gradient is a Riemannian gradient that incorporates the geometry of the parameter space. We analyze the optimization dynamics of minimizing the KL divergence with respect to two common parametrizations of the probability simplex: the exponential family representation ($\theta-$coordinates) and the mixture family representation ($\eta-$coordinates). We compare the Euclidean gradient descent (GD) in the dual coordinates with the coordinate-free natural gradient descent (NGD). In continuous-time, we prove that the convergence rates of GD in the $\theta-$ and $\eta-$coordinates form lower and upper bounds, respectively, on the convergence rate of NGD. Furthermore, affine transformations of the dual coordinates allow for arbitrary scaling of these convergence rates. The fast convergence of NGD typically reported in the literature is thus not immediately evident in continuous-time. In contrast, NGD’s superiority becomes transparent in discrete-time, where it achieves faster convergence and demonstrates greater robustness to noisy gradients, outperforming GD. The analysis hinges on bounding the condition number of the Hessian of the loss function evaluated at the optimum, which in the case of the KL divergence coincides with the Fisher information matrix.
Applications of toric geometry to the Bregman divergence
In this talk, we consider a special class of convex polytopes called Delzant polytope, each of which has a natural dually flat structure. There exists a geometric correspondence, called the Delzant construction, which gives a bijective correspondence between the set of all Delzant polytopes and toric symplectic manifolds. A toric symplectic manifold is a symplectic manifold equipped with a "nice" torus action and can be seen as a singular Lagrangian fibration. Each toric symplectic manifold carries a compatible torus-invariant complex structure and Riemannian metric, namely a torus-invariant Kahler structure. The Delzant construction can be understood as a variant of Dombrowski's construction and Hsu's construction which gives a correspondence between dually flat manifolds and Kaher manifolds.
We give applications of the Delzant construction to the Bregman divergence of the Delzant polytope. More precisely, we define a natural extension of the divergence to the boundary of the polytope and extend the generalized Pythagorean theorem and the projection theorem to include the boundary points. This extension may be useful for handling zero probability in parameter estimation or related areas. This talk is based on a paper: "Fujita, H. The generalized Pythagorean theorem on the compactifications of certain dually flat spaces via toric geometry. Info. Geo. 7, 33–58 (2024).
Converse Coding Theorems for Distributed Hypothesis Testing and Its Strong Connection with Information Geometry
Statistical inference under the framework of distributed source coding was posed by Berger in 1979. Regarding this problem, although some important partial solutions have been obtained by intensive research so far, it is generally unresolved. In this paper, we consider the hypothesis testing problem based on output data from two distributed encoders. We deal with the case where one of the two distributed encoders is an identity map. In this case we explicitly derive an upper bound of the optimum power exponent. This upper bound together with some analytical argument yields the converse coding theorem for the multiterminal zero rate hypothesis testing problem. Information geometry for the joint distribution p_{XY} of the correlated discrete random pair (X,Y) plays a substantial role in deriving the upper bound.
Efficiency of the Method of Generalized Moments from the Viewpoint of Information Geometry
The Generalized Method of Moments (GMM) is a standard method of econometrics to deal with the over-identification problems caused by the endogeneity of the model. In the method, an appropriate positive definite matrix is employed as a weight matrix, and solving the high-dimensional simultaneous equations is replaced by minimizing the quadratic form of the weight matrix. Chamberlain (1987) found an optimal weight matrix selection method that allows the GMM estimator to achieve semiparametric efficiency bounds.
In this paper, we understand the GMM estimation from the viewpoint of information geometry. Two dually flat structures of the GMM model are proposed: one is based on the metric given by the weight matrix, and the other is based on the co-metric obtained from the asymptotic normality of the GMM estimator. They are generally not equivalent, but this paper shows that they coincide if and only if the optimal weight matrix is chosen.
The result implies that we can illustrate the efficiency of the GMM estimation by a Pythagorean theorem between the two metrics derived from the criterion function of the estimation and from the asymptotic distribution of the estimator. This insight also has some implications for estimation efficiency in other methods, such as least squares or maximum likelihood.
Statistical trasnformation models and $\alpha$-geodesic flows
This talk deals with the geometrical and dynamical aspects of the geodesic flows of the $\alpha$-connections associated to statistical transformation models. A statistical transformation model consists of a smooth sample manifold with a Lie group action. A family of invariant probability measures induces the Fisher-Rao metric and the Amari-Chentsov cubic tensor over the Lie group, which are both left-invariant. In the talk, the general framework of the geodesic flows associated to the $\alpha$-connections is reviewed. Then, various examples are explained in detail, such as the family of multivariate normal distributions and a class of models parameterized by compact semi-simple Lie groups. Geometrically, certain subriemannian structures are linked to the geodesic flows. From the dynamical point of view, the dynamical properties of equilibrium points for the $\alpha$-geodesic flows are analyzed.
Harmonic exponential families on homogeneous spaces
Exponential families play an important role in the fields of information geometry, statistics and machine learning. By definition, there are infinitely many exponential families. However, only a small part of them are widely used. Our goal is to give a framework to deal with these "good" families systematically. In light of the observation that the sample space of most of them are homogeneous spaces of certain Lie groups, we proposed a method to construct exponential families on homogeneous spaces by taking advantage of representation theory of Lie groups. This method generates widely used exponential families such as normal, gamma, Bernoulli, categorical, Wishart, von Mises-Fisher, and hyperboloid distributions. In this talk, we will explain the method and its properties. This talk is based on joint work with Taro Yoshino.
The differential structure shared by probability and moment matching priors on non-regular statistical models via the Lie derivative
In Bayesian statistics, the selection of noninformative priors is a crucial issue. There have been various discussions on theoretical justification and problems for the Jeffreys prior, as well as alternative objective priors. Among them, we will focus on the two types of matching priors consistent with frequency theory: the probability matching priors and the moment matching priors. In particular, there seems to be no clear relationship between these two matching priors on non-regular statistical models, even though they have similar objectives.
Considering information geometry on a one-sided truncated exponential family, a typical example of non-regular statistical models, we obtain the result that the Lie derivative along one vector field provides the conditions for the probability and moment matching priors. Note that this Lie derivative does not appear in regular models. This result promotes a unified understanding of probability and moment matching priors on non-regular models. Further, we discuss the relationship between the probability and moment matching priors and the α-parallel priors.
Continual learning on curved statistical manifolds
Continual learning (CL) is a common challenge in machine learning, where the goal is to sequentially learn a stream of tasks while avoiding overwriting previously learned ones. Probabilistic approaches provide a principled framework for CL by balancing prior knowledge retention and new task adaptation. Sequential learning can be improved by maximizing the evidence lower bound, that involves minimizing the Kullback Leibler (KL) divergence between the previous and current task distributions. In the probabilistic machine learning literature, the family of distributions used to describe the model parameters is typically the exponential family. Exponential families and the KL divergence describe dually flat geometric structures. In this work, we go beyond dual flatness and generalize probabilistic CL using the recently developed theory on the $\lambda$-exponential family and $\lambda$-logarithmic divergence, recovering the exponential family and KL divergence when $\lambda$ goes to zero. This family of distributions and divergence is closely related to the Rényi and Tsallis divergences and the q-exponential family. The $\lambda$-exponential family has been shown to better capture higher order interactions, which we hypothesize can occur between tasks in sequential learning. In summary, we analyze the effect of curvature, captured by $\lambda$, in the setting of continual learning.
Exploring unrecognized Markov-invariant flat structures on denormalized state spaces
This talk explores the extension of the Fisher metric and α-connections from the probability simplex to the denormalized state space, while maintaining Markov invariance (as discussed in [1], [2, p.47], and [3]). I will introduce a previously unexamined class of Markov-invariant flat structures on denormalized state spaces, derived from a novel solution to the differential equation characterizing flatness. In these newly defined structures, the α-affine coordinate systems of denormalized state spaces are rescaled by a factor of 1/τ, where τ is the denormalization function introduced in Eq. (2.66) of [2]. This study also addresses a fundamental question: why is the probability simplex classified as a (-1)-autoparallel submanifold instead of a (+1)-autoparallel submanifold in the denormalized state space? Notably, under the proposed structures, the probability simplex emerges as a (+1)-autoparallel submanifold, offering an alternative perspective on the (±1)-dually flat nature of the probability simplex.
[1] L.L.Campbell, Information Sciences, 35 (1985) 199-210.
[2] S.Amari and H.Nagaoka, Methods of Information Geometry (AMS and Oxford, 2000).
[3] A.Fujiwara, Lecture notes at Hokkaido University (in Japanese, unpublished, 2016).
Any kähler metric is a Fisher information metric
The Fisher information metric or the Fisher-Rao metric corresponds to a natural Riemannian metric defined on a parameterized family of probability density functions. As in the case of Riemannian geometry, we can define a distance in terms of the Fisher information metric, called the Fisher-Rao distance. The Fisher information metric has a wide range of applications in estimation and information theories. Indeed, it provides the most informative Cramer-Rao bound for an unbiased estimator. The Goldberg conjecture is a well-known unsolved problem which states that any compact Einstein almost Kähler manifold is necessarily a Kähler-Einstein. Note that, there is also a known odd-dimensional analog of the Goldberg conjecture in the literature. The main objective of this presentation is to establish a new characterization of coKähler manifolds and Kähler manifolds; our characterization is statistical in nature. We report a generalization of Kobayashi's theorem, and then we confront our statistical characterization of the Kähler manifold with the integrability condition of S. Goldberg.
Finally, we corroborate that every, Kähler and co-Kähler manifolds, can be viewed as being a parametric family of probability density functions, whereas Kähler and coKähler metrics can be regarded as Fisher information metrics.
The generalized maximum q-work formulation based on information geometry
The maximum work formulation of the second law of thermodynamics has been generalized to transitions between nonequilibrium states. The generalization involves the Kullback–Leibler divergence between nonequilibrium states and canonical states. The Kullback–Leibler divergence scaled by the temperature of the canonical state quantifies the work available for extraction from the nonequilibrium state. This scaled Kullback–Leibler divergence can be interpreted as an energy-dimensional divergence in information geometry. The generalized Pythagorean theorem relating three energy-scaled divergences, which we interpret as squares of the thermodynamic distance, gives a geometrical interpretation of the generalized maximum work formulation [1]. Our argument can be extended for Tsallis q-statistics. The Amari--Ohara’s normalized q-divergence [2] is scaled by the temperature of the q-canonical state. The generalized Pythagorean theorem relating three energy-dimensional normalized q-divergences gives the generalized maximum q-work formulation and its geometric structures [3].
[1] T. Nakamura, H. Hasegawa, and D. Driebe, J. Phys. Comm. 3 (2019) 015015.
[2] S. Amari and A. Ohara, Entropy 13 (2011) 1170.
[3] T. Nakamura, Ph. D thesis in 2021.
Fisher Information Degeneracy and Symmetry Breaking in Diffusion Models: A Non-Equilibrium Phase Transition Perspective
Our research focuses on diffusion models, a class of generative AI systems that learn the data generation process by iteratively adding noise to the data and then reversing this diffusion process. Both the forward diffusion and the reverse processes can be described by the corresponding Fokker-Planck equations. We propose a novel connection between the fixed points of these Fokker-Planck equations and the critical moments at which Fisher information degenerates over time. Our work extends the theoretical framework established by Gabriel Raya and Luca Ambrogioni (2023), who linked the fixed points of the Fokker-Planck equations to moments of spontaneous symmetry breaking in potential functions. This research offers a pathway to understanding phase transitions in non-equilibrium systems, analogous to the role of Landau theory in equilibrium systems.
An ordinary differential equation for entropic optimal transport and its linearly constrained variants
It is well known that information geometry provides a connection between the Wasserstein distance and Kullback-Leibler (KL) divergence through entropically regularized optimal transport. In this work, we characterize the solution to the entropically regularized optimal transport problem via a well-posed ordinary differential equation (ODE). This ODE describes a continuous interpolation between the classical optimal transport problem (linked to the Wasserstein distance) and the fully regularized problem (associated with KL divergence). Our approach is broadly applicable to discrete marginals and general cost functions, extending to multi-marginal problems and those with additional linear constraints. The formulation of the ODE also allows one to compute derivatives of the optimal cost when the ODE parameter is 0, corresponding to the fully regularized limit problem in which only the entropy is minimized.
Matrix realizations of transformation exponential families
A transformation family is a statistical model consisting of probability measures generated from one measure with an group action. In this presentation, we discuss a transformation family which is an exponential family at the same time. We shall realize such a family as a generalization of the Wishart laws.
Statistical manifold with degenerate metric
A statistical manifold is a pseudo-Riemannian manifold endowed with a Codazzi structure. This structure plays an important role in Information Geometry and its related fields, e.g., a statistical model admits this structure with the Fisher--Rao metric. In practical application, however, the metric may be degenerate, and then this geometric structure is not fully adapted. In this study, for such cases, we introduce the notion of quasi-Codazzi structure which consists of a possibly degenerate metric (i.e., symmetric (0,2)-tensor) and a pair of coherent tangent bundles with connections. This is thought of as an affine differential geometry of Lagrange subbundles of para-Hermitian vector bundles. As a special case, the quasi-Codazzi structure with flat connections coincides with the quasi-Hessian structure previously studied by Nakajima--Ohmoto.
Moduli spaces of left-Invariant statistical structures, dually-flatness and conjugate symmetries
In the context of information geometry, a concept known as left-invariant statistical structures on Lie groups has been defined by Furuhata–Inoguchi–Kobayashi [Inf.Geom.(2021)]. In this talk, we introduce the classification of important classes called “dually flat” and “conjugate symmetric” within left-invariant statistical structures on commutative Lie groups $\mathbb{R}^n$ and certain two types of almost abelian Lie groups. This talk is based on joint work with Yu Ohno (Hokkaido University), Takayuki Okuda (Hiroshima University), and Hiroshi Tamaru (Osaka Metropolitan University).
Information geometry of operator scaling
Matrix scaling is a classical problem with a wide range of applications. It is known that the Sinkhorn algorithm for matrix scaling is interpreted as alternating e-projections from the viewpoint of classical information geometry. Recently, a generalization of matrix scaling to completely positive maps called operator scaling has been found to appear in various fields of mathematics and computer science, and the Sinkhorn algorithm has been extended to operator scaling. In this study, the operator Sinkhorn algorithm is studied from the viewpoint of quantum information geometry through the Choi representation of completely positive maps. The operator Sinkhorn algorithm is shown to coincide with alternating e-projections with respect to the symmetric logarithmic derivative metric, which is a Riemannian metric on the space of quantum states relevant to quantum estimation theory.
Quantum natural gradient without monotonicity
Natural gradient (NG) is an information-geometric optimization method that plays a crucial role, especially in the estimation of parameters for machine learning models like neural networks. To apply NG to quantum systems, the quantum natural gradient (QNG) was introduced and utilized for noisy intermediate-scale devices. Additionally, a mathematically equivalent approach to QNG, known as the stochastic reconfiguration method, has been implemented to enhance the performance of quantum Monte Carlo methods. It is worth noting that these methods are based on the symmetric logarithmic derivative (SLD) metric, which is one of the monotone metrics. So far, monotonicity has been believed to be a guiding principle to construct a geometry in physics. In this paper, we propose generalized QNG by removing the condition of monotonicity. Initially, we demonstrate that monotonicity is a crucial condition for conventional QNG to be optimal. Subsequently, we provide analytical and numerical evidence showing that non-monotone QNG outperforms conventional QNG based on the SLD metric in terms of convergence speed.
On Partitioning of Goodness-of-fit statistics for Symmetry in contingency tables from the viewpoint of Information Geometry
Numerical and asymptotic partitioning of goodness-of-fit statistics have been considered for numerous models in contingency tables. In this talk, we focus on partitioning goodness-of-fit statistics for symmetry in contingency tables. The symmetry model is also doubly flat and represented as the intersection of some e-flat submodel (e.g., conditional symmetry, quasi-symmetry models) and an m-flat submodel (e.g., global symmetry, marginal homogeneity models) orthogonal to that submodel so that Wald test statistic for the symmetry model can be exactly partitioned into Wald test statistics for their submodels.
On the other hand, there are very few models for which the likelihood ratio test statistic for the symmetry model can be exactly partitioned. We reconsider conditions of submodels for the exact partitioning of the likelihood ratio test statistic for the symmetry from the viewpoint of Information Geometry.
Geometric Marginal Homogeneity in Compositional Tables Based on Simplicial Geometry
Compositional data represent parts of a whole, constrained to sum to a constant. Square compositional tables arise when a whole is divided by the cross-classification of the same two factors. We introduce geometric marginal homogeneity (GMH) for square compositional tables using Aitchison geometry on a simplex. The study demonstrates that GMH tables form a subspace in a simplex and develops methods for orthogonal projections onto this subspace. These projections enable orthogonal decomposition of compositional tables into GMH and independent-skew-symmetric components. To quantify the departure from GMH, we propose a measure based on the Aitchison distance between the original table and its GMH projection. This measure provides a scalar value representing the overall geometric marginal heterogeneity in the table. For a more detailed analysis, we introduce the concept of a geometric marginal heterogeneity array. This array visualizes cell-wise contributions to the overall heterogeneity, offering insights into specific patterns of departure from the GMH. The geometric marginal heterogeneity array allows for the identification of parts or categories that contribute most significantly to the lack of GMH, enhancing interpretability of the results.
The average distance function as a characterization of probability functions
Probability functions are widely used in data analysis and machine learning. For each value or range of the inspected random variable, there is a quantity that expresses the probability or relative frequency of an observation with that magnitude. Here we propose the average distance function to further characterize probability distributions. This new function has the same support as the probability function from which it is derived. All points with the same magnitude are piled up in the same bin, and points in different bins are to be compared in terms of the distance between the bins. The characterization of that comparison offers the basis of the average distance function. The average distance function is convex and can be approximated by a second degree polynomial. The parameters of this polynomial define a new space. The description of probability functions in theis space offers several advantages, such as the possibility of comparing distributions in an efficient way by focusing on the parameters that define the approximation polynomial. We show that our proposal is relevant for grasping aspects of data distribution that are usually overlooked while inspecting probability functions. The results of applying this algorithm to several well known distributions are presented.
On the attainment of the Wasserstein-Cramer-Rao lower bound.
In information geometry, the Fisher information is regarded as a Riemannian metric that defines the local distance structure in the space of probability distributions. It also gives a lower bound on the variance of (unbiased) estimators in the Cramer-Rao inequality. On the other hand, it is recently reported that the Wasserstein distance, which is the optimal transportation cost between distributions, induces another Riemannian metric and an analogous inequality called the Wasserstein-Cramer-Rao inequality holds. The Wasserstein metric is obtained explicitly in statistical models on the real line by parametrizing the bijection that gives the push-forward measure instead of parametrizing the probability density directly. This parametrized bijection also gives the metric in multivariate models whose copulas do not depend on parameter.
Considering this parametrization, a necessary and sufficient condition for estimators to attain the Wasserstein-Cramer-Rao lower bound is obtained.
Furthermore, if the statistical model is the location scale family, estimators of the mean and variance asymptotically attain the lower bound.
f-divergence based modeling of asymmetric structures in multi-way ordinal contingency tables
This study presents a f-divergence-based approach for modeling asymmetric structures that capture linear relationships in multi-way ordinal contingency tables. Our framework extends traditional asymmetry models, offering greater flexibility in capturing complex dependence patterns while preserving key symmetric properties. We establish theoretical foundations, including new decomposition theorems for symmetry models and asymptotic properties of test statistics. The methodology's adaptability to various divergence measures allows us to capture previously undetectable asymmetries in multivariate categorical data, enhancing model interpretation and goodness-of-fit.
Shrinkage priors for models with circulant correlation structure
We construct shrinkage priors for Bayesian prediction of multivariate Gaussian models with unknown covariance matrices that have circulant correlation structure. The focus of our method is on shrinkage of non-eigenvalue components of covariance matrices. We propose shrinkage priors that asymptotically dominate Jeffreys prior with respect to the Kullback--Leibler risk.
Passive BCI for Dementia Prediction Using Path Signature and Riemannian Geometry Classifier
Passive brain-computer interface (pBCI) is a neurotechnology application focused on assessing brain health, explicitly detecting neurodegenerative processes, and monitoring non-pharmacological interventions. Like traditional BCI machine learning applications, pBCI encounters challenges related to brainwave EEG noise and non-stationarity. In this initial study, we introduce an application based on a path signature to address the issue of noisy EEG. The path signature involves calculated integrals from multidimensional paths used in modeling EEG recordings. It remains consistent under translation and time reparametrization, making it a reliable feature for analyzing multichannel EEG time series. Combined with the geometric structure of symmetric positive definite (SPD) matrices and applied with a gold standard Riemannian classifier in the BCI, it offers exciting possibilities for further exploration and analysis. The preliminary results from experiments targeting mild cognitive impairment (MCI) in elderly individuals, a significant predictor of potential dementia, allow for creating digital biomarkers by modeling the underlying neurodegenerative brain mechanisms and exploring the lead-lag relationship captured by path signature. The obtained initial results expand the scope of machine learning applications for socially suitable applications by exploring geometric features of the negative square of the lead matrices constructed from the second level signature and using a regularization term to obtain SPD matrices as features further supplied to Riemannian classifiers.
Toward Information Geometric Mechanics
The work presented in this talk uses ideas from semidefinite programming and information geometry to efficiently simulate gas dynamics in the presence of shock waves. The latter cause severe numerical challenges for classical and learning-based solvers.
The talk begins by observing that shock formation arises from the deformation map reaching the boundary of the manifold of diffeomorphisms. This motivates using the log-determinant barrier function of semidefinite programming to modify the geometry of the manifold such that the deformation map approaches but never reaches its boundary. This information geometric regularization (IGR) preserves the original long-time behavior without forming singular shocks, greatly simplifying numerical simulation. The modified geometry on the diffeomorphism manifold is also the information geometry of the mass density. I will show how this observation motivates information geometric mechanics that views the solutions of continuum mechanical equations as parameters of probability distributions to be evolved on a suitable information geometry, promising far-reaching extensions of IGR.
On the comparison between the minimum information copulas under fixed rank correlations
Copulas have become very popular as a statistical model to represent dependence structures between multiple variables in many applications. Given a finite number of constraints in advance, the minimum information copula is the closest to the uniform copula when measured in Kullback-Leibler divergence. For these constraints, the expectation of moments such as Spearman's are mostly considered in previous researches. These copulas are obtained as the optimal solution to convex programming. On the other hand, we present a novel minimum information copula where Kendall's τ is fixed to a certain constant, and further show that this copula is identical to the Frank copula.
An explicit formula for Hessian potentials on warped product manifolds
We study Hessian geometry on warped product manifolds. In particular, we consider the case where the fiber space of the warped product is a Hessian manifold and the base space is an interval. For a given Hessian potential on the fiber space and some warping functions, we find an explicit formula for a Hessian potential and the affine coordinate systems on the warped product manifold. Moreover, we consider the obtained formula from the viewpoint of equiaffine differential geometry.
Information-geometric analysis of human EEG data
Electroencephalography (EEG) is a non-invasive and cost-effective method for recording the brain's electrical activity, making it a valuable tool for studying brain function and diagnosing neurological disorders. However, previous studies have not fully explored the information carried by different interactions between EEG channels. In this study, we performed an analysis utilizing independent component analysis (ICA) and information geometry (IG). After pre-processing the EEG signals by ICA, we binarized them in several ways and computed IG measures. We applied the method to publicly available EEG data recorded from human subjects performing a motor imagery task. By calculating the mutual information between different task states and the IG measures, we confirmed that our method successfully identified the channel pairs carrying a significant amount of information. Currently, we are analyzing the information carried by the third-order IG interactions and examining the temporal dynamics of the mutual information over the duration of the trials.
Statistical manifolds with Divisible Cubic Form
Statistical manifolds with a Divisible cubic form appear in the theory of weighted Riemannian manifolds. This terminology is from the theory of affine differential geometry, where statistical manifolds originated. In Riemannian geometry, the geodesical connectedness holds on connected complete Riemannian manifolds from the Hopf Rinow theorem. However, this geodesical connectedness theorem does not hold for general geodesically complete affine connections (even if the affine connection is a Levi-Civita connection of some semi-Riemannian metric!). In this presentation, we will see that on a statistical manifold, if the cubic form is divisible then the geodesical completeness of the affine connection implies the geodesical connectedness. The relation between the canonical divergence and geodesics on a statistical manifold with a divisible cubic form will also be revealed, leading to theorems that determines the topology of the manifold.
Point Cloud Registration via Gaussian Mixture Model Embedding in Symmetric Positive Definite Manifolds
Gaussian Mixture Models (GMMs) are important tools in machine learning, signal processing, computer vision etc. due to their ability to approximate any smooth density. However, measuring dissimilarity between GMMs poses significant challenges. The Kullback-Leibler (KL) divergence, a standard measure for probability distributions has no closed-form expression for GMMs. This limitation has led to various approximations, including KL lower and upper bounds, and has motivated the development of novel distances with closed-form expressions for GMMs, such as the statistical squared Euclidean distance, the Jensen-Rényi divergence, and the Cauchy-Schwarz divergence. In an earlier work, we embedded K-component GMMs into the manifold of Symmetric Positive Definite (SPD) matrices and obtained closed form formula for the distance. This distance is lower bound for Fisher-Rao metric. The Riemannian metric on the embedded manifold is computationally efficient and better in performance wise compared to many of the existing distances. In this paper, the Riemannian metric on the embedded manifold is applied to point cloud registration. Our process involves fitting GMMs to point clouds, embedding them in SPD matrices, computing Riemannian metric, and iteratively transforming one point cloud for optimal alignment. This method shows better performance then the existing techniques, providing efficient tools for GMM-based applications.
Weyl’s gauge symmetry on the gradient-flows in information geometry
Through our previous studies [1] on the gradient-flows in information geometry (IG), we have shown that the pre-geodesic equations associated with the gradient-flows in IG are related to the general autoparallel equations in the Weyl integrable geometry. Weyl’s gauge symmetry plays a significant role.
In this contribution, starting from the invariant action under the Weyl gauge transformation, it is shown that the gradient-flow equations in IG can be derived. As a result, the Weyl’s gauge transformations relate an alpha connection to the Levi-Civita connection on the Riemannian manifold equipped with the conformal metric which consists of Fisher metric and a scalar function. These results imply the deep connection to the scalar-tensor theory of gravitation, e.g. Brans-Dicke theory [2].
References
1) T. Wada, A.M. Scarfone, Eur. Phys. J. B 97, 103 (2024); S. Chanda, T. Wada, Int. J. Geom. Methods Mod. Phys. 21, 2450098 (2023); T. Wada, A.M. Scarfone, H. Matsuzoe, Int. J. Geom. Methods Mod. Phys. 20, 2450012 (2023).
2) C. Brans and R. Dicke, Mach's Principle and a Relativistic Theory of Gravitation, Phys. Rev. 124, 925 (1961).
Information geometry from the coarse viewpoint
We introduce a quantitatively weak version of sufficient statistics such that the Fisher metric of the induced parameterized measure model is bi-Lipschitz equivalent to the Fisher metric of the original model. We characterize such statistics in terms of the conditional probability or by the existence of a certain decomposition of the density function in a way similar to characterizations of due to Ay-Jost-L\^e-Schwachh\"ofer and Fisher-Neyman for sufficient statistics.
Spectral Renyi divergence: properties in statistics and optimization / Information geometrical structure of determinantal point process
(1) We study a specific category of statistical divergences for spectral densities of time series: the spectral $\alpha$-R\‘{e}nyi divergences, which includes the Itakura--Saito divergence as a subset. While the spectral R\‘{e}nyi divergence has been acknowledged in past works, its statistical attributes have not been thoroughly investigated. The aim of our work is to highlight these properties. We reveal a variational representation of spectral R\‘{e}nyi divergence, from which the minimum spectral R\’{e}nyi divergence estimator is shown to be robust against outliers in the frequency domain, unlike the minimum Itakura–Saito divergence estimator, and thus it delivers more stable estimate, reducing the need for intricate pre-processing. This is a joint work with Tetsuya Takabatake.
(2) We investigates the information geometrical structure of a determinantal point process (DPP). We show that a DPP is embedded in the exponential family of log-linear models. The extent of deviation from an exponential family is analyzed using the $\mathrm{e}$-embedding curvature tensor, which identifies partially flat parameters of a DPP. On the basis of this embedding structure, we discover the duality related to a marginal kernel and an $L$-ensemble kernel. This is a joint work with Hideitsu Hino.