Bayesian Statistics for Unknown Uncertainty

The issue with Bayesian statistics was that it could not handle unknown uncertainty. In a real world, where unknown uncertainty is different from a specific statistical model prepared by a user, neither Bayes' theorem nor Bayesian updating has any meaning. Here let us study a framework how to treat unknown uncertainty in Bayesian statistics.

Nowadays, we know that unknown uncertainty is different from any specific pair of a model and a prior, in other words, we need modern Bayesian statistics for a large world: Uncertainty is unknown and all models are wrong. Could we represent unknown uncertainty and find some useful model ？

If you are interested in this page, please visit the following paper.

Sumio Watanabe, Mathematical theory of Bayesian statistics for unknown information source. Philosophical Transactions, 2023. https://doi.org/10.1098/rsta.2022.0151

There are two ways how to interpret an unknown uncertainty and a statistical model in modern Bayesian statistics.

In the first case, both an unknown uncertainty and a statistical model are based on a person's decision, both of which are epistemic probabilities. That is to say, a person believes both that there exists an unknown uncertainty and that a statistical model is prepared as only a candidate. Remark that a person is aware that a person's own model may be different from the unknown uncertainty in general. This interpretation would be useful for the more rational decision than believing person's own model and prior.

Note: We know that any statistical model is different from the phenomenon from which data are generated. In fact, in statistics, the claim of the following paper,　”All models are wrong”, is well known in all statisticians.

Box, G. E. P. "Science and statistics". Journal of American Statistical Association, Vol. 71,pp. 791–799, 1976.

In the second case, the existence of unknown uncertainty is a scientific or engineering assumption and a statistical model is prepared as a candidate model made by a scientist or an engineer. If you are a scientist or an engineer, this interpretation is recommended. Because scientists and engineers need to clarify the assumptions and the models that they have made.

Both cases have the same mathematical framework. Therefore, the same mathematical theory holds in both cases. From the mathematical point of view, we don't need to be bound by either any particular interpretation or any special philosophy.

In the older Bayesianism between 1920s and 1950s, a premise was set that a statistical model must be believed to be equal to the unknown uncertainty. That is to say, a small world is necessary, which is the fatally intolerant and bounded premise of the older Bayesianism. This older Bayesianism is incompatible with the claim of G.E.P. Box. In the modern Bayesian statistics and machine learning, a model and an uncertainty can be distinguished. This situation is called a large world. From the theoretical viewpoint, the setting of a large world contains a small world as a very special case. Thus, if a person who rejects the large world, then a person should reject any small world.

Based on Savage's Theorem, a person or an artificial intelligence who makes some decision automatically believes the existence of some probability distribution. Remark that, if a person believes both "the existence of a person's own specific model" and "the non-existence of any unknown uncertainty", then a person believes a mathematical contradiction. Because a candidate model is a special example of general unknown uncertainty. In other words, if a person believes any unknown uncertainty does not exist, then a person should reject the existence of person's own specific model. In such a case, a person or an artificial intelligence cannot make any decision based on probability theory.

Note: If a person believes or assumes that unknown uncertainty is not a probability distribution, however makes model and prior using probability distributions, then a person's inference by Bayesian statistics consists of contradictions, resulting that any statistical result is quite weak and unreliable (non-Bayesian statistics, too).

Note: If a person cannot judge scientifically whether unknown uncertainty is a probability distribution or not, then a person had better not choose any decision using statistics, and examine scientific background once again. For example, if an unknown data-generating process is a non-additive measure, it cannot be estimated by either Bayesian or non-Bayesian statistics. In fact, human's natural preference relations cannot be handled by Bayesian statistics, because they do not satisfy Savage's axiom, resulting that you need more general framework than probability theory.

Remark: Modern Bayesian statistics has evolved significantly from the older nonscientific Bayesianism between 1920s and 1950s.

If you are interested in the small and large worlds, the following paper and book are recommended.

(1) Ken Binmore, On the foundations of decision theory. Homo Oeconomicus, Vol.34, pp.259-273, 2017.

(2) Richard McElreath, Statistical rethinking : A Bayesian course with Examples in R and Stan. CRC Press, 2nd edition, 2020.

In the paper (1), the author says that Savage said the old Bayesianism has no meaning in the large world. In the book (2), it is clearly explained the reason why statistics of a small world is useless in scientific research.

If you are interested in the modern viewpoints about subjectivity and objectivity, the following paper is recommended.

(3) Andrew Gelman and Christian Hennig, Beyond subjective and objective in statistics. Journal of the Royal Statistical Society, series A, vol.180, pp.967-1033, 2017.

This paper (3) proposes that thinking about statistics in terms of the virtues of context-sensitivity and transparency provides a better perspective for the future, rather than viewing statistics in terms of the old opposition between subjectivity and objectivity.

In the most leading textbook of the modern Bayesian statistics,

(4) "Bayesian Data Analysis 2013" by Gelman et. al., (BDA3),

the authors say (In Chapter 6),

"A good Bayesian analysis, therefore, should include at least some check of the adequacy of the fit of the model to the data and the plausibility of the model for the purposes for which the model will be used. "

The older Bayesianism between 1920s and 1950s, which had the premise that a person must believe a person's own pair of model and prior completely equals unknown uncertainty before sampling, prohibited model checking and comparing. The modern Bayesian statistics has evolved significantly, which is free from the nonscientific Bayesianism philosophy of 1920s -1950s. It should be emphasized that, Bayes and Laplace, who are the original creators of Bayesian statistics before the 20th century, were also free from Bayesianism. Professor Andrew Gelman, who is the greatest modern Bayesian statistician, says "Bayesians are frequentists", repeatedly. Nowadays, we are aware that a pair of model and prior is not a belief but a candidate setting which should be checked by samples and scientific knowledge. Hence the modern Bayesian statistics can be employed in science and engineering based on clear understanding of assumptions set by a scientist and an engineer who are responsible for science and engineering.

Moreover, if you learn singular learning theory, you can understand that Bayesian statistics provides the more precise inference than the maximum likelihood method even asymptotically, when a model contains hidden variables or hierarchical structure. Bayesian statistics becomes more important as a statistical model becomes larger and more complex. For the future study, we had better be aware that the importance of Bayesian statistics originates from not subjective philosophy but mathematical accuracy.

Note: In Bayesian epistemology in Stanford Encyclopedia of philosophy, it is introduced that

"In fact, those results have already appeared in standard textbooks on Bayesian statistics, such as the influential one by Gelman et al. (2014: sec. 4.4 and ch. 6). The line between frequentist and Bayesian statistics is blurring. "

If a sample {X1,X2,...,Xn} is subject to an unknown uncertainty, they are exchangeable. For a set of exchangeable random variables, by de Finetty-Hewitt-Savage theorem, both an unknown probability distribution q(x) and an unknown functional probability distribution Q(q) exist. Moreover, for an exchangeable sample, the central limit theorem holds. However, the limit value of the sample mean of {Xi} depends on an unknown data-generating distribution q(x), which is not E[X] but E[X|q]. Here q(x) is often referred to as the true distribution, which is a function-valued random variable with the unknown functional distribution Q. We have to emphasize once more that the limit value of the sample mean is not E[X] but E[X|q].

The posterior distribution is defined by using the candidate specific model and prior. Since a specific model is different from the unknown uncertainty in general, the posterior distribution is only a virtual or formal one defined by a model in general.

The posterior predictive distribution is also defined, which represents the estimated prediction of the unknown uncertainty, however, we do not know whether it gives a good prediction for the unknown uncertainty or not, because the unknown uncertainty may be different from a specific model.

In order to examine whether the estimated predictive distribution is appropriate for unknown uncertainty, the generalization loss and the leave-one-out cross validation are introduced. Their expectation values for a given q(x) are equal to each other.

Note: Posterior predictive check (PPC) is a good alternative method for validation of a model and a prior. Gelman-Shalizi (2012) recommends PPC from the viewpoint of falsifiability. Remark that the older Bayesianism cannot allow PPC. Professor Gelman explains that a prior distribution is often not a belief but a part of a model. We need a modern Bayesian statistics where an unknown uncertainty and a statistical model can be distinguished. If textbooks you read recommend PPC, then they are free from the older Bayesianism philosophy.

A. Gelman and C. R. Shalizi, Philosophy and the practice of Bayesian statistics, British Journal of Mathematical and Statistical Psychology, 2012, https://doi.org/10.1111/j.2044-8317.2011.02037.x

In this paper, the authors say ''the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypothetico-deductivism."

The minus log marginal likelihood is often called the free energy. Both the marginal likelihood and the cross validation are useful, but they are different criteria by the definition. This is not a paradox. The free energy shows the generalization loss for the simultaneous distribution p(X^n), whereas the cross validation does the conditional one p(X|X^n). Note that the generalization loss is equal to the increase of the free energy.

Several information criteria have been created. By using these methods, the marginal likelihood and the generalization loss can be estimated. Even for an unknown uncertainty, we can examine whether a candidate model and a prior is appropriate or not.

The hyperparameter that maximizes the marginal likelihood is different from one which minimizes the leave-one-out cross validation. The former does not minimizes E[Gn|q]. The latter minimizes E[Gn|q]. Neither of them minimizes the random variable Gn.

If you are interested in the optimization of the hyperparameter, please visit,

S. Watanabe, Higher Order Equivalence of Bayes Cross Validation and WAIC. Springer Proceedings in Mathematics and Statistics, Information Geometry and Its Application, pp.47-73, 2018.

Fisher Information matrix is defined by this equation. If an unknown uncertainty was realizable by a statistical model and if Fisher information matrix was positive definite, then the posterior distribution could be approximated by a normal distribution whose covariance is (nI)^(-1).

In deep learning, almost all eigen values of Fisher information matrix are zero. It is easy to check this phenomenon. If you use the error-backpropagation, then you calculate gradient vector, V. Fisher information matrix can be obtained by E[VV^T]. It is easy to compute this matrix by your computer.

In classical statistical models, Fisher information matrix is positive definite. Hence the posterior distribution concentrates in the neighborhood of some parameter as sample size increases. This process may be understood as "gradual increase of confidence" of a learning machine.

The parameter space of deep learning consists of many local models, one of which is a smaller model and another is a larger model.

In deep learning, learning process is a jumping from a singularity to another singularity. Even if the sample size is huge, it is far smaller than the infinity in deep learning. Hence deep neural networks are always in singular states. In other words, they always compare many singularities from the viewpoint of bias and variance ( = energy and entropy), and phase transitions are repeated as the sample size increases ( = gradual discovery phenomenon). This is the reason why algebraic geometry is necessary in Bayesian statistics and AI alignment. Now many excellent researchers begin to study mathematical foundation of AI.

If you are interested in the further mathematics, please visit

Sumio Watanabe, Recent advances in algebraic geometry and Bayesian statistics. Information Geometry,VVol.7, pp.187–209, 2024 https://doi.org/10.1007/s41884-022-00083-9

Page updated

Google Sites

Report abuse