Compare with Bayes and Maximum Likelihood

Computing with Confidence: Imprecise Posteriors and Predictive Distributions

Scott Ferson1, Jason O’Rawe1,2, and Michael Balch3

1Applied Biomathematics,100 North Country Road, Setauket, New York 11733 USA;1-631-751-4350; fax 1-631-751-3435; scott@ramas.com

2Genetics Program, Stony Brook University, Stony Brook, New York 11794 USA; 1-631-632-8812; fax 631-632-6900; jazon33y@gmail.com

3Arctan, Inc., 2200 Wilson Boulevard, Suite 102-150, Arlington, Virginia 22201 USA; 1-540-357-0324; michael.balch@arctan-group.com

ABSTRACT

Confidence structures (c-boxes) are imprecise generalizations of confidence distributions. They encode frequentist confidence intervals, at every confidence level, for parameters of interest and thereby characterize the inferential uncertainty about distribution parameters estimated from sparse or imprecise sample data. They have a purely frequentist interpretation that makes them useful in engineering because they offer a guarantee of statistical performance through repeated use. Unlike traditional confidence intervals which cannot usually be propagated through mathematical calculations, c-boxes can be used in calculations using the standard methods of probability bounds analysis and yield results that also admit the same confidence interpretation. This means that analysts using them can now literally compute with confidence. We provide formulas for c-boxes in several important problems including parametric and nonparametric statistical estimation from random sample data. The results are imprecise characterizations analogous to posterior distributions and posterior predictive distributions. We contrast this c-box approach to statistical estimation using traditional maximum likelihood and Bayesian methods.

Keywords: confidence structure, c-box, confidence interval, probability bounds analysis, maximum likelihood, Bayesian estimation.

INTRODUCTION

What scheme should be employed to characterize input variables from sample data in the context of imprecise probabilities? Several traditional methods of statistical inference have been extended for imprecise probabilities when data are imprecise or priors are uncertain, including the method of matching moments, maximum likelihood, and robust Bayes methods (Walley 1991; Ferson et al. 2007). Methods are needed that allow analysts to translate random sample data directly into appropriate characterizations of input distributions.

CONFIDENCE INTERVALS AND CONFIDENCE DISTRIBUTIONS

A confidence interval (Neyman 1937) for parameter θ with coverage probability α has the property that, among all confidence intervals independently computed by the same method, at least a proportion α will contain the true value of θ. A confidence interval can serve as an estimate of the parameter that is more comprehensive than any point estimate because it encodes not only the available data but also the sampling uncertainty they imply. Valid confidence intervals are more than merely subjective characterizations of uncertainty; they represent rigorous claims and their use establishes a standard of statistical performance that in principle can be checked empirically. Credible intervals (sometimes called Bayesian confidence intervals) are often considered to be the Bayesian analogs of confidence intervals (Lee 1997), but credible intervals have no general accompanying guarantee like that of the frequentist notion of confidence intervals.

Confidence distributions were introduced by Cox (1958), but received little attention in the literature until a recent renaissance of interest (Efron 1998; Schweder and Hjort 2002; Singh et al. 2005; Xie et al. 2011; Xie and Singh 2013; inter alia). A confidence distribution is a distributional estimate for a parameter, in contrast with a point estimate like a sample mean or an interval estimate such as a confidence interval. It has the form of a distribution function on the space of possible parameter values that depends on a statistical sample in a way that encodes confidence intervals at all possible confidence levels. A confidence distribution for a parameter θ∈Θ is a function C: Θ→(0,1) such that, for every α in (0,1), (−∞, C−1(α)] is an exact lower-sided 100α% confidence interval for θ, where the inverse function C−1(α) = Cn−1(x1, …, xn, α) is increasing in α. This definition implies [C−1(α), C−1(β)] is a 100(β−α)% confidence for the parameter θ. Although related to many other ideas in statistical inference (Singh et al. 2005; Xie et al. 2011), a confidence distribution can be considered a purely frequentist concept (Schweder and Hjort 2002; Singh et al. 2005). Although a confidence distribution has the form of a probability distribution, it is not a probability distribution. It corresponds to no randomly varying quantity; the parameter it describes is presumed to be fixed and nonrandom. The value of the function C is not probability of θ, but rather confidence about θ (Cox 2006; cf. Lindley 1958). A confidence distribution is merely a ciphering device that encodes confidence intervals for each possible confidence level.

Confidence distributions are not widely known in statistics, but Efron (1998) characterized bootstrap distributions as approximate confidence distributions, and so the essential ideas are familiar and widely used, albeit under the guise of bootstrap distributions. Efron (2013) suggested that, because they can be thought of as a way to ground in frequentist theory objective Bayesian analyses that use uninformative priors, confidence distributions may be useful in resolving the most important problem in statistical inference, which is how to use Bayes’ theorem without prior information. There are two significant limitations that might prevent such a resolution. The first is that confidence distributions do not exist for many basic and important inferential problems. Notably, in particular, there is no confidence distribution for the binomial probability. Likewise, it is not clear how they could work in a nonparametric setting. The second limitation is that, although they have the form of probability distributions, they cannot be propagated in calculations. Distributions derived from confidence distributions via the probability calculus are not in general confidence distributions themselves (Schweder and Hjort 2013; Cox 2006).

C-BOXES

Balch (2012) introduced the notion of confidence structures, which we have taken to calling confidence boxes, or c-boxes for short as an imprecise generalization of confidence distributions that redress some of their limitations. They encode frequentist confidence intervals, at every confidence level, for parameters of interest. If a c-box for a parameter θ has the form of a p-box specified by its left and right bounding cumulative distribution functions B1 and B2, then every interval [B1-1(α), B2-1(β)] is a 100(β−α)% confidence interval whenever α<β. They are analogous to Bayesian posterior distributions in that they characterize the inferential uncertainty about distribution parameters estimated from sparse or imprecise sample data, but they have a purely frequentist interpretation that makes them useful in engineering because they offer a guarantee of statistical performance through repeated use. Unlike traditional confidence intervals which cannot usually be propagated through mathematical calculations, c-boxes can be used in calculations using the standard methods of probability bounds analysis and yield results that also admit the same confidence interpretation. This means that analysts using them can now literally compute with confidence.

Balch (2012) described various ways to derive c-boxes, and proved that independent c-boxes characterizing different parameters can be combined in mathematical expressions using the conventional technology of probability bounds analysis (Ferson et al. 2003) and random-set convolutions via Cartesian products (Yager 1986) and the results also have the confidence interepretation. Ferson et al. (2013) reviewed the properties of c-boxes, provided algorithms to compute c-boxes for some special cases and confirm their coverage properties, and compared the c-box for the binomial probability to the Imprecise Beta Model (Walley 1991; Walley et al. 1996).

Table 1 is a compendium of formulas for several important c-box cases. For each of these cases, the first line defines the sampling model, and specifies summary statistics if needed. The second line describes the associated p-box estimator for the distribution of next observable values. This p-box is an imprecise generalization of a frequentist prediction distribution, and it is analogous to a Bayesian’s posterior predictive distribution. If its left and right edges are B1 and B2, the interval [B1−1(α), B2−1(β)] is a prediction interval for Xn+1 enclosing a fraction β−α of the observable values on average. Subsequent lines give formulas for c-boxes for the parameters. In the table, the env function denotes the envelope operation which forms a p-box from two bounding distribution functions. Note that the parameters of some of the named distributions in the table may be given as intervals denoted in square brackets, which of course also induce p-boxes (Ferson et al. 2003). For the sake of notational simplicity, we have generalized the tilde beyond its conventional use in frequentist statistics. An expression of the form X ~ F is understood to mean that the uncertainty about the quantity X is characterized by F. This tilde can still be read as “has the distribution”, or maybe better as “has uncertainty like”, but it obviously does not suggest that the left-hand side is necessarily a random variable. When the left-hand side is a parameter, it is after all a value that is fixed albeit unknown.

Table 1. C-boxes for distributions and parameters for various sampling models.

The c-box for the Bernoulli and (first) binomial probability is equivalent in form to the Imprecise Beta Model, as explained in Ferson et al. (2013). The c-box for the second binomial case is computed via bootstrap simulation, which has been implemented as an R function accessible on line at https://sites.google.com/site/cboxbinomialnp/.

For all the parametric cases in which shape assumptions are made about the distribution from which the data are randomly sampled, the p-box for the next observation Xn+1 is generally a continuous mixture distribution generated by composing the c-box for the parameter through the distribution specified in the sampling model. Thus, for the first binomial case (with known N), this is a continuous mixture of binomial distributions for infinitely many binomial rates which are themselves distributed according to a beta distribution, yielding a beta-binomial distribution. For the exponential case, it is a continuous mixture of exponential distributions with parameter values distributed as a gamma distribution. The resulting distribution function for gamma-exponential(b,c) is 1−(1/(bx+1))c. For the Poisson case, the shape of the p-box for the next observation is likewise a gamma-Poisson mixture distribution, but this shape is more commonly known as a negative binomial distribution. In both the Bernoulli and normal cases, the mixture distributions resulting from the compositions degenerate to simple Bernoulli and (shifted and scaled) Student distributions respectively. Whether or not the mixture distribution for the next observable value Xn+1 has a named shape, it can be computed by numerical composition (Ferson et al. 2003, §3.2.1.6) from the parameter c-boxes. In the case of the Bernoulli or binomial model, for instance, the c-box for the probability p can be discretized into a collection of intervals by horizontally slicing it into intervals of equal thickness. These intervals are then used to define a collection of Bernoulli or binomial p-boxes which are then combined in a stochastic mixture with weights corresponding to the thicknesses of the respective intervals.

The c-box in the nonparametric case, at the bottom of Table 1, where no shape assumption is made about the sampling distribution can be computed by forming empirical distribution functions from the data augmented by either plus or minus infinity. This c-box results from what we have called the “relaxed sample rule” (Ferson et al. 2005; cf. Solana and Lind 1990) which is based on the idea that n random deviates from a distribution divide the real line up into n+1 segments of equal probability. An intuition for this rule is that we do not know how the probabilities are distributed within each of those segments. The c-box is essentially an equiprobable mixture of little intervals over each of the n+1 segments of the real line.

The c-boxes given in Table 1 are not the only possible solutions for the sampling models. In principle, there could be other c-boxes that demonstrate the required performance features but are different in detail from these. Likewise, confidence distributions are not unique for a particular problem, which has been a point of concern and criticism (Xie and Singh 2013; Robert 2013). Balch (2012) considers this non-uniqueness to be a deficiency of the c-box approach, but this need not be so. Confidence intervals are not unique either. In some cases, one may be clearly better than another if, for instance, one

is a proper subset of the other. But more generally, it may be hard to compare confidence intervals, even if one is narrower than another, if it is not contained inside it. But the diversity of confidence intervals is not considered a problem in practical applications. Indeed, the diversity may allow users to solve nuanced problems or take advantage of possible windfalls. In general, it can be useful to know multiple ways to skin a cat.

COMPARISON WITH BAYES AND MAXIMUM LIKELIHOOD

As part of a larger review (https://sites.google.com/site/niharrachallengeproblems/) of statistical and constraint-based methods for characterizing input distributions for use in risk analysis and general uncertainty modeling, we compared the c-box approach against traditional maximum likelihood and Bayesian methods. Space limits preclude a full description of comparisons, so we focus on three cases with example data sets:

Users of Bayesian inference are well aware that the results of analyses depend on the prior distributions assumed for parameters. In fact, the ability to account for prior information is a key feature of the approach. Many users may not appreciate, however, that this dependence on priors extends to situations in which analysts specifically disclaim any prior information. Like The Honeymooners character Ed Norton’s inability to put down a piece of paper, Bayesians cannot escape the prior. Their results will always be affected by their choice of a prior, and they cannot decline to choose a prior. When the data are sparse, as they often are in practical situations, these effects can be substantial.

For this reason, there have been many attempts in the Bayesian literature to find priors or ways to select priors that have minimal influence on the analysis for use when the analyst has no conscious information about the parameters’ possible values. Several authors have reviewed the general conundrum (Bernardo 1979; Tibshirani 1989; Kass and Wasserman 1996; Syversveen 1989; Walley 1991; Tuyl et al. 2009). The original “uninformative” prior was the uniform distribution in which all possibilities are assigned equal probability, an idea which arose from the Principle of Indifference dating back to Laplace himself. Difficulties and inconsistencies with uniform priors led Jeffreys (1946) to suggest a class of priors that are invariant under reparameterization of the parameter vector. Several other approaches have also been suggested, including reference priors (Bernardo 1979; Berger and Bernardo 1989), maximal data information priors (Zellner 1998), and consensus priors, among others.

Bayesians do not agree about what the prior ought to be when analysts have essentially no prior information, even for the simplest and most fundamental problems of statistical inference. Thus, practitioners are left with many inconsistent choices. In our binomial example, Figure 1 depicts five possible Bayesian solutions (Walley 1991, p228f; Winkler et al. 2002; Tuyl et al. 2009). The associated posteriors (from leftmost to rightmost) arise from Haldane’s └┘ prior, Jeffreys’ ⋃, Zellner’s saucer, the Bayes-Laplace uniform ┌┐, and the ⋂ prior suggested by Walley (1996).

Given a sequence of n binary (Bernoulli) trials resulting in k successes, the maximum likelihood estimator for the binomial probability is the ratio k / n. This yields the same result as the Bayesian analysis with the Haldane prior, which is that p=0. Figure 1 also shows the c-box for p. The maximum likelihood estimate and four of the five Bayesian posteriors fall inside the c-box for the binomial probability.

Figure 1. Multiple uninformative Bayesian priors (left graph) and associated posterior distributions (right graph) for the binomial probability p. The maximum likelihood estimate corresponds to the degenerate distribution at zero. The c-box for p is shown in gray.

There are also many solutions for the normal inference problem, even in the case where the analyst professes no specific prior knowledge about either parameter. For instance, Yang and Berger (1998, page 25) suggest three possible posteriors for μ:

μuniform ~ m + s × Student(n - 3) × √((1 − 1/n) / (n − 3)) (uniform prior)

μreference ~ m + s × Student(n − 1) / √n (reference/MDIP prior)

μJeffreys ~ m + s × Student(n + 1) × √((1 − 1/n) / (n + 1)) (Jeffreys prior)

where m = (∑ Xi)/n and s2 = ∑(m − Xi)2/(n − 1), from Xi ~ normal(m, s), i = 1, …, n. These variant solutions are depicted for the normal example data set in Figure 2 as cumulative distributions. The differences among them are big when the sample size is small. This issue can be critical in disciplines such as risk analysis or extreme engineering in new environments where data are perennially in very short supply.

The maximum likelihood estimate for the normal mean is the sample mean, which is the point on the μ axis in Figure 2 where the three lines cross.

The c-box solution for the normal mean has the same shape as the middle curve in Figure 2, which is the Bayesian posterior based on the reference prior (Bernardo 1979), which also agrees with the maximum data information prior (Zellner 1998). But of course the c-box has a totally different interpretation that the Bayesian posterior distribution, which, although a characterization of rational uncertainty about the parameter m in light of the sample data, makes no guarantee whatever about whether or how often it will be a reasonable estimation.

Figure 2. Multiple Bayesian posterior distributions for the normal mean based on different uninformative priors. The middle one is the c-box for μ. The maximum likelihood estimate is the point at which the curves cross.

The third example is the nonparametric problem. The maximum likelihood estimate of the distribution is the empirical distribution function of the data, which is shown as a black step function in Figure 3. It turns out that the classical Bayesian solution to the nonparametric problem based on an uninformative Dirichlet prior is the same as the empirical distribution function (Ferguson 1973, page 223). The figure also depicts in gray the c-box analog which is a prediction structure for the next value Xn+1. Notice that the imprecise structure has 13, rather than 12 steps.

Figure 3. C-box (gray) for the nonparametric case, compared with the empirical distribution function (black), which is both the maximum likelihood estimate and the Bayesian estimate.

CONCLUSION

Because they offer a guarantee of statistical performance through repeated use, traditional (Neyman) confidence intervals are attractive to engineers who seek such assurance. In the past, analysts could not use confidence intervals in subsequent analyses and assessments because they cannot generally be propagated through mathematical calculations. Confidence structures (c-boxes) generalize confidence distributions and provide an interpretation by which confidence intervals at any confidence level can be specified for a parameter of interest. More importantly, c-boxes can be used in calculations using the standard methods of probability bounds analysis, and these calculations yield results that also admit the confidence interpretation. This means that analysts using them can now compute with confidence, both figuratively and literally.

We have presented formulas to compute c-boxes and the associated p-boxes for distributions of observable values for several important cases, including parametric and nonparametric statistical estimation from random sample data. The results of these formulas include characterizations of inferential uncertainty analogous to both posterior distributions and posterior predictive distributions but with fundamentally different interpretations that do not depend on prior assumptions.

We contrasted this c-box approach to statistical estimation using traditional maximum likelihood and Bayesian methods and compare the results graphically. Maximum likelihood methods can characterize inferential imprecision arising from sampling uncertainty, but they do so in a dead-end way. Although confidence intervals can usually be computed for maximum likelihood estimators, these confidence intervals cannot be easily incorporated into subsequent calculations that typify risk assessments and uncertainty modeling. Bayesian analysis, on the other hand, allows follow-on calculations but does not support any guarantee of statistical performance. The diversity of Bayesian solutions to these inference problems arising from disagreements among Bayesians about the characterization of the uninformative prior turns out in our numerical examples to be greater than the imprecision of the c-box, sometimes massively so.

Maximum likelihood estimators, predicated on an optimality criterion, are well known to not be reliable for small sample sizes. Bayesian methods, which are predicted on coherence conditions, produce a variety of answers to a question, even under epistemically identical conditions. In contrast, the c-box approach, which is built on the establishment and maintenance of statistical performance, can sustain its guarantees even for very small sample sizes and, because it is not focused on finding the answer for a problem, it creates a methodology that can nevertheless achieve mission-wide performance at any desired level of surety.

ACKNOWLEDGMENTS

We thank Kevin Shoemaker, Masatoshi Sugeno and Jimmie Goode at Applied Biomathematics, Stephan Munch of NOAA, and Keith Hayes of CSIRO for helpful discussions. Support was provided by the National Library of Medicine, a component of the National Institutes of Health (NIH) within the United States Department of Health and Human Services, through a Small Business Innovation Research grant (RC3LM010794) to Applied Biomathematics funded under the American Recovery and Reinvestment Act. The views and opinions expressed herein are solely those of the authors and not those of the National Library of Medicine or NIH.

REFERENCES

Balch, M.S. (2012). “Mathematical foundations for a theory of confidence structures.” International Journal of Approximate Reasoning 53: 1003−1019.

Berger, J.O. and Bernardo, J.M. (1989). “Estimating a product of means: Bayesian analysis with reference priors.” Journal of the American Statistical Association 84: 200−207. http://www.uv.es/~bernardo/1989JASA.pdf

Bernardo, J.M. (1979). “Reference posterior distributions for Bayesian inference.” Journal of the Royal Statistical Society B 41: 113−147 (with discussion).

Cox, D.R. (1958). “Some problems with statistical inference.” The Annals of Mathematical Statistics 29: 357−372.

Cox, D.R. (2006). Principles of Statistical Inference. Cambridge University Press.

Efron, B. (1998). “R.A. Fisher in the 21st century.” Statistical Science 13: 95−122.

Efron, B. (2013). International Statistical Review 81: 41−42.

Ferguson, T.S. (1973). “A Bayesian analysis of some nonparametric problems.” The Annals of Statistics 1: 209−230. http://projecteuclid.org/euclid.aos/1176342360

Ferson, S., V. Kreinovich, L. Ginzburg, K. Sentz and D.S. Myers (2003). Constructing Probability Boxes and Dempster-Shafer Structures. SAND2002-4015, Sandia National Laboratories, Albuquerque, NM. http://www.ramas.com/unabridged.zip

Ferson, S., J. Hajagos, D.S. Meyers and W.T. Tucker (2005). Constructor: Synthesizing Information about Uncertain Variables. Sandia National Laboratories, SAND2005-3769, Albuquerque,NM. www.ramas.com/constructor.pdf

Ferson, S., V. Kreinovich, J. Hajagos, W.L. Oberkampf and L. Ginzburg (2007). Experimental Uncertainty Estimation and Statistics for Data Having Interval Uncertainty. SAND2007-0939, Sandia National Laboratories, Albuquerque, NM. http://www.ramas.com/intstats.pdf

Ferson, S., M. Balch, K. Sentz, and J. Siegrist (2013). “Computing with confidence.” Proceedings of the 8th International Symposium on Imprecise Probability: Theories and Applications, F. Cozman et al. (eds.), SIPTA., T. Denœux, S. Destercke and T. Seidenfeld. SIPTA, Compiègne, France. https://sites.google.com/site/confidenceboxes/isipta

Jeffreys, H. (1946). “An invariant form for the prior probability in estimation problems.” Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186: 453−461.

Kass, R., and L. Wasserman (1996). “The selection of prior distributions by formal rules.” Proceedings of the Royal Society A186: 453−461.

Lee, P.M. (1997). Bayesian Statistics: An Introduction. Arnold.

Lindley, D.V. (1958). “Fiducial distributions and Bayes’ theorem.” Journal of the Royal Statistical Society, Series B 20: 102−107.

Neyman, J. (1937). “Outline of a theory of statistical estimation based on the classical theory of probability.” Philosophical Transactions of the Royal Society A237: 333−380.

Schweder, T., and N.L. Hjort (2002). “Confidence and likelihood.” Scandinavian Journal of Statistics 29: 309−33.

Schweder, T., and N.L. Hjort (2013). International Statistical Review 81: 56−68.

Singh, K., M. Xie and W.E. Strawderman (2005). “Combining information from inde-pendent sources through confidence distributions.” The Annals of Statistics 33: 159−183.

Solana, V., and N.C. Lind (1990). “Two principles for data based on probabilistic system analysis.” Proceedings of ICOSSAR '89, 5th International Conferences on Structural Safety and Reliability. American Society of Civil Engineers, New York.

Syversveen, A.R. (1998). “Noninformative Bayesian priors, interpretation and problems with construction and applications.” Preprint Statistics 3, Mathematical Sciences, NTNU, Trondheim.www.ime.unicamp.br/~veronica/ME705/paper2.pdf

Tibshirani, R. (1989). “Noninformative priors for one parameter of many.” Biometrika 76: 604−608.

Tuyl, F., R. Gerlach and K. Mengersen (2009). “Posterior predictive arguments in favor of the Bayes-Laplace prior as the consensus prior for binomial and multinomial parameters.” Bayesian Analysis 4: 151−158.

Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. Chapman & Hall.

Walley, P. (1996). “Inferences from multinomial data: learning about a bag of marbles.” Journal of the Royal Statistical Society, Series B 58: 3−57.

Walley, P., L. Gurrin and P. Barton (1996). “Analysis of clinical data using imprecise prior probabilities.” The Statistician 45: 457−485.

Winkler, R.L., J.E. Smith, and D.G. Fryback (2002). “The role of informative priors in zero-numerator problems: being conservative versus being candid.” The American Statistician 56: 1−4, with discussion.. See also Comments by Browne and Eddings and Reply. The American Statistician 56: 252−253.

Xie, M., and K. Singh (2013). “Confidence distribution, the frequentist distribution estimator of a parameter—a review.” International Statistical Review 81: 3−39.

Xie, M., K. Singh and W.E. Strawderman (2011). “Confidence distributions and a unifying framework for meta-analysis.” Journal of the American Statistical Association 106(493): 320−333.

Yager, R.R. (1986). “Arithmetic and other operations on Dempster−Shafer structures.” International Journal of Man-Machine Studies 25: 357−366.

Yang, R., and J.O. Berger (1998). “A catalog of uninformative priors”, http://www.stats.org.uk/priors/noninformative/YangBerger1998.pdf

Zellner, A. (1998). “Past and recent results on maximal data information priors.” Journal of Statistical Research 32: 1−22.

Page updated

Google Sites

Report abuse