How data informs us when we can’t make the usual assumptions
Risk analysts often need to estimate parameters such as prevalence of a disease in a population, the failure rate of assembly-line products, the political support for a candidate from polling information, or the abundance of an endangered species among more common congeners. These estimates may be needed for subsequent calculations of a risk assessment. Various statistical schools of thought have devised schemes for making such estimates, characterizing their uncertainties, and propagating them through computations.
This on-line collaboration is a virtual round-table discussion on the respective advantages and disadvantages of these approaches in the face of the kinds of complexities commonly faced in risk analysis, such as small or very small sample sizes, non-random sampling, data that may be non-randomly missing, ambiguous sample observations that cannot be easily classified into one or the other category, and nonstationarity of the underlying probability rates. Champions for the major approaches (classical, frequentist, Bayesian, imprecise probability, and confidence-structure methods) have presented their solutions to several simple but non-standard binomial estimation problems, involving combinations of these complexities. The problems emphasize the need to make subsequent computations that account for the uncertainties of the estimations.
The purpose of this exercise is to highlight the practical differences among the various approaches in difficult but common problems, including the computational (and conceptual) burdens on the risk analyst of each approach. The collaboration began with an informal symposium and interactive panel discussion at the annual meeting of the Society for Risk Analysis in 2011 in Charleston, South Carolina. The discussions in Charleston were lively and multifaceted. The collaboration continues here with an on-going discussion about how binary data informs us in estimating underlying probability rates and making calculations with these estimates. Everyone is invited to contribute to this discussion about the practicality of the methods to challenges we meet in risk analysis where data is scarce and poorly structured. Terse contributions will typically be appended as comments at the bottom of the relevant webpage, but more elaborate contributions, including alternative solutions to the challenge problems, may merit their own affiliated page. Contributors retain copyright to their submissions. Contributions will be moderated for relevance.
Keywords: bag of marbles, binomial proportion, binomial rate, percentage, k out of n, binary data, confidence interval, credible interval, confidence distribution, c-box, confidence structure, uncertainty, risk analysis, National Library of Medicine, Applied Biomathematics
The challenge problems are based on three different tasks:
A. Evaluating rates
The A problems are to characterize the underlying probability p, from k successes in n independent trials. This problem arises when we want to assess the success rate for a medical procedure, estimate the probability of failure for some manufactured component, or evaluate the performance of a process, employee or policy. This is the classic binomial inference problem, where the sample data are the outcomes of the previous trials, classified into successes and failures.
B. Comparing rates
The B problems are to determine which of two such probabilities is larger, based on an observed k1 successes in n1 independent trials in the first process and k2 successes in n2 independent trials in the second process. This problem arises when a patient must decide between different medical treatments based on outcomes from the procedure experienced by other patients, or when a factory manager must select a vendor based on previous deliveries, or when an employer must decide which employees to retain and which to lay off based on their past performances. The problem is to determine which is better and, in some cases, also to determine by how much. Note that this does not necessarily translate directly into a decision problem. There may be other qualitative factors involved in the decision as well as the average success rate.
C. Propagating rate information
The C problems are to characterize the joint probability of success that depends on success in each of two separate processes, based again on the same k-out-of-n data for the two processes. This problem can arise when we try to estimate the risk of successful operation or failure of a two-part assembly, or in the evaluation of a manufacturing plan that depends on two vendors supplying needed parts.
In the real world, each of these kinds of problems can be, and usually is, complicated by one or several factors:
Small sample size
Small sample size is a classical problem of statistics in which we try to infer broad patterns from a sometimes very narrow glimpse of information. The issue arises in novel situations common in risk analysis. For example, a medical treatment might have just been introduced so it has been used only a few times before, or is new to a hospital. Or a manufactured assembly to be tested might be a new design so only a few have so far been built. Sample sizes may also be low when sampling is expensive or when information is private or proprietary, or simply has not been the focus of previous attention.
Non-independence
The sequential samples from a single process may not be independent. This complexity can arise in several ways. For instance, because E. coli contamination in slaughterhouses typically affect the knives and grinders used in meat processing, sequentially produced units of ground beef are not likely to be independent in terms of the probability of being sanitary versus contaminated. Contamination is highly clustered within production lots. In contrast, the efficacy of a policy that depends on humans for implementation may tend to yield an over-dispersed pattern in performance if a recent failure tends to make people work more attentively over the next several cycles.
Missing data
Observations are occasionally missing from data sets. For instance, it may be impossible to find samples at all the factorial combinations demanded in a complex study design. Sometimes experimental subjects drop out of protocols. Sometimes experiments must be terminated before all outcomes can be resolved. Sometimes collected data are physically lost because of human error or corruption of data storage media. Statistical methods such as imputation have been devised to handle missing data that can be assumed to be missing at random. But these methods are of limited utility when the reason the data are missing is related to the data values or their interactions with other data, or in cases where the reason the data are missing cannot be discerned.
Ambiguous observations
In some situations, data may not be completely missing and there may thus be partial information about an outcome. Statisticians say that such data are censored. In other cases, the collected data may be precisely observed yet still be ambiguous. For instance, some manufactured assemblies may be on the border between being acceptable and unacceptable and it is impossible to tell conclusively whether or not they are to specification. Likewise, it may be hard for physicians to decide whether a medical treatment resulted in a positive or negative outcome because of the complexity of biological functions, or on account of confounding factors that obscure the outcome. In the case of binary data (true/false, healthy/sick, etc.), ambiguous outcome data are degenerate in that they become vacuous ("true or false", "healthy or sick", etc.). In the case, however, of multinomial data, ambiguous observations may still contain some information. For example, if the outcomes are categorical, we may be able to rule out some categories while still not being able to confidently specify a single category for the outcome.
Non-stationarity
Sometimes there is not a single process at play, but instead multiple processes or an evolving series of processes that produce successes or failures at different rates through time. For instance, if you're assessing the performance of a newly hired employee on an assembly line, the success rate my improve with the employee’s experience. In such cases, the binomial rate parameter is not a constant, but a changing quantity.
Non-representativeness
Almost all statistical methods assume that the data collected are representative of the underlying process we are trying to measure. But this assumption may not always be true. An employer may have inadvertently chosen to evaluate an employee’s productive quality at a stressful time for the employee, such as during a period of domestic strife. Or the employer may have deliberately chosen to conduct an evaluation at a stressful time, such as after an announcement of impending layoffs. In both cases, the evidence of counting the numbers of employee’s successes and failures may not be representative of the employee’s performance in normal times. Structured sampling can also be non-representative. Making observations only in the early morning may not yield data that properly represent the average performance rates over the course of an entire day.
Quantitatively evaluate a policy that has been in place for 60 months which has led to 15 months that are considered successes, but 45 months that are considered failures. The observations each month were 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, where 0 denotes a failure and 1 a success. How should we characterize the probability of success and our uncertainty about this probability?
Two alternative processes are to be compared. The observations for the first process were 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, which represent 12 successes and 48 failures out of 60 trials. The observations for the second process were 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, which represent 22 successes and 38 failures out of 60 trials. Given that we prefer success to failure, which process should be used for the next trial? Which process yields the better performance? How large is the difference between the two processes? Is this difference statistically significant, i.e., beyond what might be reasonably attributed to chance alone?
Using the data of problem B1, how should we characterize the product (and its uncertainty) of the probabilities of success of the two processes?
In another case, the policy evaluation period was short and only 8 trials could be accumulated, of which 6 were successes. The observations were 1, 0, 1, 1, 1, 1, 0, 1. What is the probability of success, and what is its uncertainty?
We want to quantitatively evaluate the comparison between two new processes, for which only 8 random trials have so been run. The observations for the first process were 0, 0, 0, 1, 0, 1, 1, 1, and the observations for the second process were 1, 1, 1, 1, 1, 1, 1, 1. How large is the difference and does it represent reliable evidence that the second process is actually better than the first?
Using the data of problem B2, what can be said about the product of the probabilities of the two processes?
Using the data of problem B2, what can be said about the probability of joint success which depends on success first with the first process and then with the second process if we cannot be sure that the two processes are independent?
The observations for a new process were 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, which are 38 successes and 52 failures in 90 trials. These observations do not pass a runs test for randomness, which means that sequential values are not independent. What can be said about the probability of success of the next trial drawn from this process?
Twenty trial units of a complex assembly were ordered and produced. The evaluation and testing procedure involves long-term storage of the assemblies, during which 5 of them went missing. We are not sure what might have happened. The assemblies might be missing because the employee constructed them badly and does not want anyone to know. In this situation, the missing assemblies are likely unacceptable. The assemblies might also be missing because they were stolen by an opportunistic burglar. In this situation, acceptable assemblies are no more or less likely to be missing than unacceptable ones. The assemblies might also have been stolen by an inspector secretly involved in industrial espionage. In this situation, the missing assemblies were likely of acceptable quality. An investigation is underway, but in the meantime, because they are missing, we cannot know what their quality is. The observations were 1, NA, 0, 0, 0, 1, 1, 1, NA, NA, 1, 1, 0, NA, 1, 1, 1, 1, 1, NA, which are 11 successes and 4 failures, and 5 missing values, in 20 trials. What can we say about the acceptability rate for these assemblies?
In 20 trials of a new medical procedure, there were 7 cases that were clearly positive outcomes, and 8 cases that were clearly adverse outcomes. In five of these trials, the result was ambiguous and we cannot really tell yet whether the outcome was good or bad. The observations were ?, 1, 0, ?, 0, 1, 1, 0, 0, 0, 1, 0, ?, 0, ?, 1, ?, 0, 1, 1, where ? denotes an ambiguous outcome. What is the probability of the next case yielding a positive outcome? What is the uncertainty about this probability?
The process that generated the observations 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1 (32 successes and 58 failures in 90 trials) seems to be improving. The number of successes in the second half of the data is more than twice that of the first half. Given that the process may not be stationary, how should we characterize the probability of success for the next trial? What is the uncertainty about that probability?
# Data sets for the challenge problems encoded for the R language and environment for statistical computing
a.1=c(0,0,0,0,0,1,1,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1)
a.2=c(1,0,1,1,1,1,0,1)
a.3=c(0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,0,0,1,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0)
a.4=c(1,NA,0,0,0,1,1,1,NA,NA,1,1,0,NA,1,1,1,1,1,NA)
a.5=c(NaN,1,0,NaN,0,1,1,0,0,0,1,0,NaN,0,NaN,1,NaN,0,1,1)
a.6=c(0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,0,1,0,1,1,0,1,0,0,1,1,1,0,0,1,1,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,1,1,1,1,0,0,0,0,0,1)
b.1=c(0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0)
c.1=c(0,0,0,0,1,1,1,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,1,1,1,0,0,0,0,0,0,1,0,1,1,1,0,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0)
b.2=c(0,0,0,1,0,1,1,1)
c.2=c(1,1,1,1,1,1,1,1)
# Key
#
# a. probability of next sample being 1
# b. chance the first process is better
# c. product of two processes
#
# .1 no data
# .2 small n
# .3 non-independent
# .4 non-random missing data
# .5 ambiguous observations
# .6 non-stationary
Everyone is welcome to join this discussion on risk and uncertainty analysis. Email your contribution to Scott Ferson at sandp8(at)gmail(dot)com. It may be in the form of simple text, PDF file, slide show or other document. Your contribution will be added to this site as you direct, perhaps appended to the bottom of this page, or maybe made into a new affiliated solution page. Your email address will not be published unless you indicate you want it to be used to sign your contribution.
This website was originated by Applied Biomathematics, which is a research and software company specializing in environmental and ecological risk analysis. For the last three decades, their function has been to translate theoretical developments in ecology and statistics into practical methods for addressing environmental and ecological problems. Support for this project was provided by the National Library of Medicine, a component of the National Institutes of Health (NIH), through a Small Business grant (award number RC3LM010794) to Applied Biomathematics funded under the American Recovery and Reinvestment Act.
The views and opinions expressed herein and in comments below are those of the individual contributors and commenters, and should not be considered those of any of the other authors or collaborators, nor of Applied Biomathematics, the Society for Risk Analysis, the National Library of Medicine, National Institutes of Health, or other sponsors or affiliates. Copyrights for the contributed material and commentary remain with their respective authors.
Wikipedia article on confidence intervals for binomial proportions
Wolfram article on confidence intervals for the binomial distribution
R packages for confidence intervals for binomial processes: PropCIs, binom, BlakerCI, binCI
The presentations by the invited panel discussants at the SRA symposium can be seen below.
Everyone is welcome to join the discussion. Send us comments or criticisms about the challenge or related issues, or your solutions to any or all of the problems. You can send a simple email, or a URL link to a webpage you are hosting, which we will reference here, or you can send a PDF or Word file suitable for posting on the web and we will upload and link to it. Your email address will not be published, unless you specifically request that it is. Send your email, link or document to Scott Ferson at sandp8(at)gmail(dot)com.
Scott Ferson
1:16 PM May 22•Comments off
Sanderson (2020a,b) discusses "probabilities of probabilities" from a Bayesian's perspective. Sanderson (2020a) introduces the fundamental considerations involved in interpreting product rankings and mentions (but does not explain) Laplace's rule of succession which computes the probability of your success as the frequency of success of the existing data of past outcomes to which have been added two more trials, one of each outcome. He acknowledges "Crucial is that we assume each review is independent of the last..." (t=465s). Sanderson (2020b) also motivates the question of estimating a probability from trial observations but digresses into continuous probability distributions and the need for measure theory to make any sense of them.
As yet, no part 3 to this series has been posted, although Sanderson pinned the comment "I have to imagine it's frustrating to follow this channel. I believe this is the third video in a row (excluding those on epidemics) that I ended by saying something like "we'll look at Bayesian updating in a continuous context in the next part". But whenever I think hard about the setup/prerequisite section of that video there's always something interesting enough to pull out to stand as its own video; there are just so many interesting topics here! Thanks for your patience, and hopefully, everyone gets that the goal here is to just hit as many fundamental ideas in probability as is reasonable. Also, in parallel with making these probability videos, I'll be trying a very different sort of experiment on the channel soon...stay tuned."
References
Sanderson, G. (3Blue1Brown, https://www.3blue1brown.com/) 2020a. “Binomial distributions | Probabilities of probabilities, part 1” https://youtu.be/8idr1WZ1A7Q
Sanderson, G. (3Blue1Brown, https://www.3blue1brown.com/) 2020b. “Why ‘probability of 0’ does not mean ‘impossible’ | Probabilities of probabilities, part 2” https://youtu.be/ZA4JkHKZM50
Tony Cox
Jul 4, 2012•Comments off
Here is a book that deals with sequence prediction when the data-generating process is unknown and probabilities are not available: http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf.
Best,
--Tony
Einar Snekkenes
Jul 4, 2012•Comments off
I very much enjoyed [the] panel at the SRA meeting. With reference to our discussion at the meeting, here are some comments relevant to the panel discussions:
Although it is interesting to see how the various methods approach the issue of uncertainty, it would be even more interesting to see what kind of underlying (potentially hidden) assumptions the various methods make. So in addition to formulating/solving the problems, each approach should include ALL underlying assumptions. By assumption I mean 'whatever that has to be assumed' in order for a strictly formal (in a formal logic sense) proof of the solutions to the exercises. The list of assumptions must also include all 'axioms'. That is, if the exercises were to be completed using say higher order logic (e.g. HOL/Isabelle) - using the proposed method - the set of assumptions must suffice for the computation/proof(s) to carry through.
Also, from a practical perspective, for each solution, for each method it must be specified why the assumption set that is chosen is in fact the 'strongest assumption set' permitted by the problem description/domain - in a 'real life setting'. I would suggest that this requirement would only add value if the problems are designed in such a way that they are fair representatives of real (although very simplified) risk assessment problems.
I would be very interested in feedback on the above relating to any uncertainty capturing/propagating method/framework/theory.
Best regards,
Einar