Bayes’ rule is
P(A|B) = P(A) P(B|A) / P(B)
where A is some event and B is new evidence relevant to it. P(A|B) is the posterior, P(A) is the prior, and P(B|A) is the likelihood. P(B) is called the normalization factor and is often computed as
P(B) = P(A) × P(B | A) + (1 - P(A)) × (1 − P(not B | not A)).
Many textbook introductions to Bayesian analysis seem to say that Bayes’ rule is extended to distributions by applying it to many possible values of some parameter θ, as in the figure below. In this case, the prior is the probability that θ has some particular value along the axis. The likelihood is the function that gives the would-be probability of seeing the observed data given that the value of θ is actually the value given on the axis. Although the prior and the resulting posterior are both distributions of θ, note that the likelihood here is not a distribution (it doesn’t integrate to unity).
In this scheme, Bayes’ rule is applied to each value of θ along its axis by multiplying the probability density of the prior for a given value of θ by the corresponding likelihood at that value of θ. We call this operation zippering because it associates the prior density function and likelihood function at corresponding values of θ. The product function that results from this zippering is proportional to the posterior but needs to be scaled to have unit area by dividing it by the normalization factor. The normalization factor is often hard to compute, but in practice it might not matter much in this scheme because we can sometimes obtain a reasonable characterization of the posterior from the unnormalized function by numerically integrating it to find the value needed to rescale it. This handy trick is not available to analysts when they use Bayes’ rule with total probabilities (i.e., scalar values) or in the alternative distributional version of Bayes’ rule that I’ll now describe.
Another possible way to extend Bayes’ rule to distributions is to characterize our uncertainty about the terms that define the likelihood as random variables that have distributions themselves. In this situation, evaluating Bayes’ rule will then require us to use generalized convolutions to combine the prior and likelihood, rather than merely zippering. For instance, if the sensitivity and specificity of a medical test are modeled as probability distributions (which we might be using to represent sampling uncertainty about them), then we might express Bayes’ rule as
posterior = 1 / (1 + ((1/prevalence − 1) × (1 − specificity)) / sensitivity)
where the prior is prevalence, which is the frequency of disease in the population [see Bayes’ rule with epistemic uncertainty and without independence assumptions for an explanation of this expression]. If specificity and sensitivity are probability distributions, then the division that combines them must be a generalized convolution rather than a simple division between real-valued numbers. Although there are other ways, this convolution is usually evaluated by Monte Carlo simulation in which possible values from each distribution are combined to build up the distribution of the quotient. This is a different scheme for extending Bayes’ rule to distributions, and we can call the operation a convolution to distinguish it from zippering.
Note that we’re using the word ‘convolution’ in a slightly more general sense than do many people for whom the term ‘convolution’ by itself is understood to refer only to the operation used to obtain the distribution of sums under independence. We are referring to any operation that obtains the distribution of any (usually basic) arithmetic function under any dependence from the marginal distributions of the operands. For us, a ‘convolution’ is any operation that in principle convolves each element of one operand with every element of the other operand. So Yager’s Cartesian product is a convolution in this sense. Operations that evaluate differences, products, quotients and other functions are also convolutions. Note, however, that zippering is not a convolution, because the arithmetic function is applied only to specific paired values from the respective operands, not to all possible pairs. So this distributional version of Bayes’ rule is decidedly different from the one popularly described in textbooks.
Is this view of a distributional version of Bayes’ rule sound? Is it compatible with standard Bayesian views? Winkler and Smith (2004) seem to argue that the distributional approach evaluated with Monte Carlo simulation such as outlined here has some significant flaw, and the approach is “incorrect” (page 656, left column) and represents “confusion in the medical decision-making literature” (page 654, abstract). Our goal is to be able to generalize the distributional version of Bayes’ rule to handle p-boxes. Is there some tweak or correction to the flaw identified by Winkler and Smith that we should make to our view before we generalize it to p-boxes and other structures representing epistemic uncertainty?