Bayes and Gibbs

In this subpage, we study two statistical estimation methods.

Bayes : Average of a model over the posterior distribution.

Gibbs : A model with a random parameter from the posterior distribution.

We show that there is a universal law between them which holds in both regular and singular cases. In general, Bayes makes the generalization loss smaller than Gibbs, however, it seems to be easier to realize Gibbs estimation than Bayes estimation, in deep learning.

Data, model, and prior are introduced. Two expectations over X and X^n are defined, where X and X^n are testing and training samples, respectively.

Let beta>0 be a positive real value, which is often called an inverse temperature. The posterior distribution is defined by the usual way. When beta tends to infinity, the posterior distribution converges to the delta function on the maximum likelihood estimator.

Bayes and Gibbs estimations are defined. Bayes predictive distribution is defined by the expectation of the model over the posterior distribution, whereas Gibbs one is by random parameter taken from the posterior distribution.

This figure illustrates the difference between Bayes and Gibbs estimations. If you have some software which generates posterior parameters, it is east to realize these methods.

Generalization and training losses are defined for both Bayes and Gibbs estimations. Then we obtain the four random variables which depend on the training sample X^n.

This is the main result of this subpage. In both regular and singular cases, Bayes and Gibbs generalization losses can be estimated from Bayes and Gibbs training losses. We would like to emphasize that such a simple relation between Bayes and Gibbs has firstly been discovered.

Remark for physics people : In statistical mechanics, the gas equation of state which describes the macroscopic phenomenon is derived from the microscopic mechanics of particles. In singular learning theory, variables of Bayes and Gibbs quartet are macroscopic values, whereas parameters in the posterior distribution are microscopic ones. Here the posterior distribution in learning theory corresponds to Boltzmann distribution in statistical mechanics. Please see here.

As random variables, these equations hold. From these equations, it follows that, if beta=1,

(Bg-S)+(CV-Sn)=2(lambda)/n +(1/n),

where CV is the Bayes cross validatin.

From the foregoing four equations, the asymptotic behavior of Bayes and Gibbs quartet is derived, where both the real log canonical threshold and the singular fluctuation are birational invariants. Their definitions are given in here and here. The former does not depend on the inverse temperature and the latter does. If you are a mathematician, then please enjoy this beautiful quartet between algebraic geometry and statistical learning theory. If you are a statistician, you can apply this useful relations to a real world problems.

Let V be the functional variance. Then we can define widely applicable information criterion for both Bayes and Gibbs estimations.

From the theorem, Gibbs and Bayes training losses can be estimated by Gibbs and Bayes generalization losses. This result shows that there exists a symmetry between Bayes and Gibbs, generalization and training losses.

References

If you are interested in this result, please find

S. Watanabe, Equations of states for singular statistical estimation. Neural Networks, vol.23,pp.20-34, 2009. Arxiv:0712.0653

For the case when q(x) is unrealizable by and regular for p(x|w),

S. Watanabe, Equations of states in statistical learning for an unrealizable and regular case, IEICE transactions on fundamentals of electronics, communications, and computer sciences, vol.E93-A, pp.617-626, 2010. Arxiv:0906.0211

In this case, the real log canonical threshold and the singular fluctuation are give by d/2 and tr(IJ^{-1})/2, respectively, where d is the dimension of the parameter, I is the Fisher information matrix, and J is the Hesse matrix of the log loss.

Page updated

Google Sites

Report abuse