Bayesian Statistics for Unknown Uncertainty

Nowadays, we know that unknown uncertainty is different from any specific pair of a model and a prior, in other words, we need modern Bayesian statistics for a large world: Uncertainty is unknown and all models are wrong. Could we represent unknown uncertainty and  find some useful model ? 


If you are interested in this page,  please visit the following paper. 

Sumio Watanabe, Mathematical theory of Bayesian statistics for unknown information source. Philosophical Transactions, 2023. https://doi.org/10.1098/rsta.2022.0151 

There are two ways how to interpret an unknown uncertainty and a statistical model in modern Bayesian statistics.

 In the first case, both an unknown uncertainty and a statistical model are based on a person's decision.  That is to say, a person believes that there exists an unknown uncertainty and that a statistical model is  prepared as only a candidate. Remark that a person is aware that a person's own model may be different from the unknown uncertainty in general. This interpretation would be useful for the more rational decision than blindly believing person's own model and prior. 


Note: We know that any statistical model is different from the phenomenon from which data are generated. In fact, in statistics, the claim of the following paper is well known in all statisticians.

Box, G. E. P. "Science and statistics". Journal of American Statistical Association, Vol. 71,pp. 791–799,


 In the second case, the unknown uncertainty is a scientific assumption and a statistical model is prepared as a candidate model made by a scientist or an engineer. If you are a scientist or an engineer, this interpretation is recommended. Because scientists and engineers need to clarify the assumptions that they have made. 

Both cases have the same mathematical framework. Therefore, the same mathematical theory holds in both cases.  From the mathematical point of view, we don't need to be bound by any particular interpretation. 

In the older Bayesianism between 1920s and  1950s, a premise was set that a statistical model must be believed to be equal to the unknown uncertainty.  A small world is necessary, which is the fatally intolerant premise of the older Bayesianism. In the modern Bayesian statistics and machine learning, a model and an uncertainty can be  distinguished. This situation is called a large world. From the theoretical viewpoint, the setting of a large world contains that of a small world as a very special case. 

Based on Savage's Theorem, a person or an artificial intelligence who makes some decision automatically believes the existence of some probability distribution. Remark that, if a person believes both the existence of a person's own specific model and the non-existence of any unknown uncertainty,  then a person believes a mathematical contradiction.  Because a candidate model is a special example of general unknown uncertainty.  In other words, if a person believes any unknown uncertainty does not exist,  then a person should reject the existence of person's own specific model. In such a case, a person or an artificial intelligence cannot make any decision based on probability theory. 

Remark: Modern Bayesian statistics has evolved significantly from the older nonscientific Bayesianism between 1920s and 1950s. 

If you are interested in the small and large worlds, the following paper and book are recommended. 

(1)  Ken Binmore, On the foundations of decision theory. Homo Oeconomicus, Vol.34, pp.259-273, 2017.

(2)  Richard McElreath, Statistical rethinking :  A Bayesian course with Examples in R and Stan. CRC Press, 2nd edition,  2020.


If you are interested in the modern viewpoints about subjectivity and objectivity, the following paper is recommended. 

(3) Andrew Gelman and Christian Hennig, Beyond subjective and objective in statistics. Journal of the Royal Statistical Society, series A, vol.180, pp.967-1033, 2017. 

This paper (3) proposes that thinking about statistics in terms of the virtues of context-sensitivity and transparency provides a better perspective for the future, rather than viewing statistics in terms of the old opposition between subjectivity and objectivity.


In the most leading textbook of the modern Bayesian statistics,

(4) "Bayesian Data Analysis 2013" by Gelman et.a al., (BDA3), 

the authors say (In Chapter 6),

 "A good Bayesian analysis, therefore, should include at least some check of the adequacy of the fit of the model to the data and the plausibility of the model for the purposes for which the model will be used. "

The older Bayesianism between 1920s and 1950s, which had the premise that a person must believe in a pair of model and prior before sampling, prohibited model checking. The modern Bayesian statistics has evolved significantly, which is free from the older nonscientific Bayesianism philosophy. Professor Gelman, who is a modern Bayesian statistician, says "Bayesians are frequentists", repeatedly. Nowadays, we are aware that a pair of model and prior is not a belief but a candidate setting which should be checked by a sample. Hence the modern Bayesian statistics can be employed in science and engineering based on clear understanding of assumptions. The modern Bayesian statisticians say, 

(5) Gelman, A, Robert, C.P. "Not only defended but also applied" : The perceived absurdity of Bayesian statistics. The American Statistician, Vol.67, pp.1-5,2013.

Moreover, if you learn singular learning theory, you can understand that Bayesian statistics provides the more precise inference than the maximum likelihood method even asymptotically, when a model contains hidden variables or hierarchical structure. Bayesian statistics becomes more important as a statistical model becomes larger and more complex. For the future study, we had better be aware that  the importance of Bayesian statistics originates from not subjective philosophy but mathematical accuracy. 

If a sample {X1,X2,...,Xn} is subject to an unknown uncertainty, they are exchangeable. For a set of exchangeable random variables, by de Finetty-Hewitt-Savage theorem, both an unknown probability distribution q(x) and an unknown functional probability distribution Q(q) exist. Moreover, for an exchangeable sample, the central limit theorem holds. However, the limit value of the sample mean of {Xi} depends on an unknown data-generating distribution q(x), which is not E[X] but E[X|q].  Here q(x) is often referred to as the true distribution,  which is a function-valued random variable with the unknown functional distribution Q. We have to emphasize once more that the limit value of the sample mean is not E[X] but E[X|q]. 

The posterior distribution is defined by using the candidate specific model and prior. Since a specific model is different from the unknown uncertainty in general, the posterior distribution is only a virtual or formal one defined by a model in general. 

The posterior predictive distribution is also defined,  which represents the estimated prediction of the unknown uncertainty, however, we do not know whether it gives a good prediction for the unknown uncertainty or not, because the unknown uncertainty may be different from a specific model. 

In order to examine whether the estimated predictive distribution is appropriate for unknown uncertainty, the generalization loss and the leave-one-out cross validation are introduced. Their expectation values for a given q(x) are equal to each other. 


Note: Posterior predictive check (PPC) is a good alternative method for validation of a model and a prior.  Gelman-Shalizi (2012) recommends PPC from the viewpoint of falsifiability. Remark that the older Bayesianism cannot allow PPC. Professor Gelman explains that a prior distribution is not a belief but a part of a model. We need a modern Bayesian statistics where an unknown uncertainty and a statistical model can be distinguished. If textbooks you read recommend PPC, then they are free from the older Bayesianism philosophy.  

A. Gelman and C. R. Shalizi, Philosophy and the practice of Bayesian statistics, British Journal of Mathematical and Statistical Psychology,  2012, https://doi.org/10.1111/j.2044-8317.2011.02037.x 

 The minus log marginal likelihood is often called the free energy. Both the marginal likelihood and the cross validation are useful, but they are different criteria by the definition. This is not a paradox. The free energy shows the generalization loss for the simultaneous distribution p(X^n), whereas the cross validation does the conditional one p(X|X^n). Note that the generalization loss is equal to the increase of the free energy. 

Several information criteria have been created. By using these methods, the marginal likelihood and the generalization loss can be estimated. Even for an unknown uncertainty, we can examine whether a candidate model and a prior is appropriate or not. 

The hyperparameter that maximizes the marginal likelihood is different from one which minimizes the leave-one-out cross validation. The former does not minimizes E[Gn|q]. The latter minimizes E[Gn|q].  Neither of them minimizes the random variable Gn. 


If you are interested in the optimization of the hyperparameter, please visit, 

S. Watanabe, Higher Order Equivalence of Bayes Cross Validation and WAIC. Springer Proceedings in Mathematics and Statistics, Information Geometry and Its Application, pp.47-73, 2018. 

Fisher Information matrix is defined by this equation. If an unknown uncertainty was realizable by a statistical model and if Fisher information matrix was positive definite, then the posterior distribution could be approximated by a normal distribution whose covariance is (nI)^(-1). 

In deep learning, almost all eigen values of Fisher information matrix are zero. It is easy to check this phenomenon. If you use the error-backpropagation, then you calculate gradient vector, V.  Fisher information matrix can be obtained by E[VV^T]. It is easy to compute this matrix by your computer.

In classical statistical models, Fisher information matrix is positive definite. Hence the posterior distribution concentrates in the neighborhood of some parameter as sample size increases. This process may be understood as "gradual increase of confidence" of a learning machine. 

The parameter space of deep learning consists of many local models, one of which is a smaller model and another is a larger model. 

In deep learning, learning process is a jumping from a singularity to another singularity. Even if the sample size is huge, it is far smaller than the infinity in deep learning. Hence deep neural networks are always in singular states. In other words,  they always compare many singularities from the viewpoint of bias and variance ( = energy and entropy), and phase transitions are repeated as the sample size increases ( = gradual discovery phenomenon). This is the reason why algebraic geometry is necessary in Bayesian statistics and AI alignment. Now many excellent researchers begin to study mathematical foundation of AI. 


If you are interested in the further mathematics, please visit

Sumio Watanabe, Recent advances in algebraic geometry and Bayesian statistics. Information Geometry,VVol.7, pp.187–209, 2024 https://doi.org/10.1007/s41884-022-00083-9