Probability is a Thought Experiment

Bruno de Finetti famously wrote “probability does not exist”.

This was his provocative entree to his theory of subjective probability and Bayesian statistics. When I unwittingly joined the Bayesian cult by going to Duke for grad school, I made it my business to try and figure out what he meant by that gnomic statement. And here is my re-interpretation: probability is a thought experiment.

Here’s what I mean. When I say the probability that a roulette wheel turns up red is 18/38, that’s a shorthand statement about a long imagined sequence of spins. It’s a thought experiment about the roulette set up. When I say the probability of a third World War in the next decade is 1%, that’s a shorthand statement about a long imagined sequence of possible worlds that could unfold.

If you’re like me, you think there is something importantly different about those two examples of thought experiments. In particular, the roulette example is “realizable” in the sense that we could in fact sit down and play roulette for many, many spins and “realize” the probability (or prove it wrong, perhaps). What I mean is that if our thought experiment was accurate, we would be able to take advantage of it by doing the thing the thought experiment was about. The second example, not so much. In my opinion, de Finetti (and also Jimmie Savage) made a mistake in insisting that these cases were fundamentally the same; I think they are both thought experiments, but that the difference between them (realizable vs unrealizable) is critical and shouldn’t be ignored.

How does this distinction matter in practice? Well, it’s relevant for the Frequentist vs Bayesian divide that we still see chatter about from time to time on LinkedIn and elsewhere. I can’t believe I’m wading into this, but here goes.

Firstly, I think it is unprofitable to talk of Frequentist methods and Bayesian methods. Rather, any data analysis method has Frequentist properties and Bayesian properties. A Frequentist property is a probability statement about a method that holds averaging over possible data X, FOR ANY underlying distribution (indexed by a parameter) theta. A Bayesian property is a probability statement about a method that holds averaging over both possible data X and possible distributions (indexed by a parameter) theta. So they are both thought experiments, but they are different thought experiments.

Now, on the one hand a Frequentist property is stronger than a Bayesian property in that if something holds for every theta, it will obviously hold on average with respect to any distribution over theta. So when a Frequentist property is available, great! That’s a Good Thing. Case closed, right?

Well, the wrinkle is that the *quality* of the available Frequentist property might be worse than the quality of the Bayesian property (with respect to a given prior). If we have an “inductive bias”, that can be super helpful. Maybe we think a regression function is smooth, or a certain coefficient has a restricted sign. Bayesian methods are a natural way (but not the only way, see below) to incorporate such “hints”. (Using hints is smart in the real world and refusing to do so in a stubborn bid for aesthetic purity is silly IMO.) Another issue is that the available Frequentist property might only be provable asymptotically, which might not be a good representative of finite-sample operating characteristics at all. Moreover, in practice we are not doing Frequentist thought experiments in the sense that we don’t get repeated size-n samples from a fixed theta. Rather, we get a theta, we get a sample, repeat — that’s the Bayesian thought experiment, integrating over X’s AND thetas. So if we knew the “true” prior, Bayesian methods are the way to go! Case closed right?

Well, no, because we don’t usually know the prior. Still, I have found myself in the past saying to colleagues: I trust my approximation of the prior more than I trust your asymptotic approximation of the sampling distribution. Or maybe you do have finite-sample guarantees, but the intervals are a lot larger because you weren’t willing to hedge a little with prior information. It’s a trade-off that should be made on a case-by-case basis.

In any event, after many years of thinking about these issues I think there are two under-appreciated facts that help bridge the divide between the Frequentist and Bayesian perspectives. One is that just like Bayesian methods have Frequentist properties, any method can have Bayesian properties. What makes a method Bayesian is just that it is theoretically optimal with respect to a given prior. But with Bayesian methods being a bear to fit in some cases (especially Bayesian non-parametric models) I think it is worth thinking about evaluating tractable methods relative to their so-called Bayes risk. That is, I’m a proponent of Bayesian evaluation of non-Bayesian estimators/models/methods. This amounts to doing Monte Carlo simulations with respect to a particular prior. This is usually a lot easier than fitting a full Bayesian model for that same prior. And — the big benefit — you can make priors more realistic and focus on estimators that have better computational properties. Lots of people have studied the frequentist properties of Bayesian estimators, but I’m walking in the other direction, wanting to investigate the Bayesian properties of tractable estimators with respect to REALISTIC priors. When you aren’t chained to using a Bayesian estimator, you are no longer tempted to fudge your model/prior for reasons of convenience. This Monte Carlo Bayes risk idea should, IMO be the default instead of over-used industry benchmarks or toy datasets. (I think this is true for evaluating these LLM models too….using fixed benchmarks just invites over-fitting.) Interestingly, I like to think of this aapproach as "computational de Finetti" because the big thing de Finetti got RIGHT, imo, is a focus on the prior predictive distribution: the range of possible data sets we believe we might see. So my prescriptive advice: do your modeling at the level of prior predictive, basing it in creative and flexible ways on historical data sets and whatever modeling tools are available, do your estimation/inference with whatever tools are flexible and practical and tune them so that they work well with respect to the prior predictive you specified, which can and should be as gnarly as you want it to be.

The second under-appreciated fact is that it is sometimes possible, and often interesting, to pursue a Frequentist analysis without ignoring relevant real-world information, i.e constraints on the data generating process. In other words, you can get closer to evaluating the realizable performance of a method without having to swallow an entire prior, by doing a frequentist style “FOR ALL theta” analysis, but subject to some constraints. Restrict your regressions to monotone functions, or unimodal densities, or particular signal-to-noise ratios, or with a restricted support — you don’t think the stock market will grow 10x in a single day? Then your statistical method doesn’t have to accommodate that possibility. This idea of restricting the "hypothesis class" is a cornerstone of statistical learning theory, but in that context the restrictions are rather abstract (smoothness classes in function space and so forth). An intersting area for future work is to figure out ways to bake in substantive assumptions without fully committing to a parametric model or a specific prior.

Page updated

Report abuse