Getting beyond the statistician’s bag of marbles:
How sampling binary data informs us when we can’t make the usual assumptions
Panel Discussion M2-F
Society for Risk Analysis Annual Meeting
10:30 am–noon, Monday, 5 December 2011
Room R8/9, Embassy Suites North Charleston - Airport/Hotel & Convention Center
North Charleston, South Carolina
This panel discussion is an interactive symposium on how binary sampling data informs us in risk analysis and uncertainty propagation when we cannot make the usual assumptions about the data such as independence, stationarity, and large sample sizes without censoring or missing values.
Risk analysts often need to estimate parameters such as prevalence of a disease in a population, the failure rate of assembly-line products, the political support for a candidate from polling information, or the abundance of an endangered species among more common congeners. Analysts must use these estimates in the subsequent calculations of a risk assessment. Various statistical schools of thought have devised schemes for making such estimates, characterizing their uncertainties, and propagating them through computations. This symposium will be a round-table discussion on the respective advantages and disadvantages of these approaches in the face of the kinds of complexities commonly faced in risk analysis, such as small or very small sample sizes, non-random sampling, data that may be non-randomly missing, ambiguous sample observations that cannot be easily classified into one or the other category, and nonstationarity of the underlying rates. Champions for the major approaches (classical, Bayesian, imprecise probabilities, and frequentist methods) will present their solutions to several simple but non-standard binomial estimation problems, involving combinations of these complexities. The problems emphasize the need to make subsequent computations that account for the uncertainties of the estimations.
This symposium will be an informal and interactive discussion about how binary data informs us in estimating underlying probability rates and making calculations with these estimates. It will highlight the practical differences among the various approaches in difficult but common problems. The champions have shared their solutions with each other before their presentations in order to focus on the differences among their answers. The computational (and conceptual) burdens on the risk analyst of each approach will be highlighted. Presentations will be arranged by problem, rather than by approach. Commentary, including alternative solutions to challenge problems, is invited from the audience. The final fifteen minutes of the symposium will be devoted to discussion and questions from the audience about the practicality of the methods to challenges we meet in risk analysis where data is scarce and poorly structured.
The discussants will be
- Michael Balch, Applied Biomathematics,
- William Huber, Quantitative Decisions,
- Dan Rozell , New York State Department of Environmental Conservation, and
- Kari Sentz, Los Alamos National Laboratory.
The panel discussion will be chaired by Scott Ferson (sandp8 at gmail dot com), who can be contacted for further information.
The challenge problems are based on three different tasks:
A. Evaluating rates
The A problems are to characterize the underlying probability p, from k successes in n independent trials. This problem arises when we want to assess the success rate for a medical procedure, estimate the probability of failure for some manufactured component, or evaluate the performance of a process, employee or policy. This is the classic binomial inference problem, where the sample data are the outcomes of the previous trials, classified into successes and failures.
B. Comparing rates
The B problems are to determine which of two such probabilities is larger, based on an observed k1 successes in n1 independent trials in the first process and k2 successes in n2 independent trials in the second process. This problem arises when a patient must decide between different medical treatments based on outcomes from the procedure experienced by other patients, or when a factory manager must select a vendor based on previous deliveries, or when an employer must decide which employees to retain and which to lay off based on their past performances. The problem is to determine which is better and, in some cases, also to determine by how much. Note that this does not necessarily translate directly into a decision problem. There may be other qualitative factors involved in the decision as well as the average success rate.
C. Propagating rate information
The C problems are to characterize the joint probability of success that depends on success in each of two separate processes, based again on the same k-out-of-n data for the two processes. This problem can arise when we try to estimate the risk of successful operation or failure of a two-part assembly, or in the evaluation of a manufacturing plan that depends on two vendors supplying needed parts.
In the real world, each of these kinds of problems can be, and usually is, complicated by one or several factors:
Small sample size
Small sample size is a classical problem of statistics in which we try to infer broad patterns from a sometime very narrow glimpse of information. The issue arises in novel situations common in risk analysis. For example, a medical treatment might have just been introduced so it has been used only a few times before, or is new to a hospital. Or a manufactured assembly to be tested might be a new design so only a few have so far been built. Sample sizes may also be low when sampling is expensive or when information is private or proprietary, or simply has not been the focus of previous attention.
Non-independence
The sequential samples from a single process may not be independent. This complexity can arise in several ways. For instance, because E. coli contamination in slaughterhouses typically affect the knives and grinders used in meat processing, sequentially produced units of ground beef are not likely to be independent in terms of the probability of being sanitary versus contaminated. Contamination is highly clustered within production lots. In contrast, the efficacy of a policy that depends on humans for implementation may tend to yield an over-dispersed pattern in performance if a recent failure tends to make the humans work more attentively over the next several cycles.
Missing data
Observations are occasionally missing from data sets. For instance, it may be impossible to find samples at all the factorial combinations demanded in a complex study design. Sometimes experimental subjects drop out of protocols. Sometimes experiments must be terminated before all outcomes can be resolved. Sometimes collected data are physically lost because of human error or corruption of data storage media. Statistical methods such as imputation have been devised to handle missing data that can be assumed to be missing at random. But these methods are of limited utility when the reason the data are missing is related to the data values or their interactions with other data, or in cases where the reason the data are missing cannot be discerned.
Ambiguous observations
In some situations, data many not be completely missing and there may thus be partial information about an outcome. Statisticians say that such data are censored. In other cases, the collected data may be precisely observed yet still be ambiguous. For instance, some manufactured assemblies may be on the border between being acceptable and unacceptable and it is impossible to tell conclusively whether or not they are to specification. Likewise, it may be hard for physicians to decide whether a medical treatment resulted in a positive or negative outcome because of the complexity of biological functions, or on account of confounding factors that obscure the outcome. In the case of binary data (true/false, healthy/sick, etc.), ambiguous outcome data are degenerate in that they become vacuous ("true or false", "healthy or sick", etc.). In the case, however, of multinomial data, ambiguous observations may still contain some information. For example, if the outcomes are categorical, we may be able to rule out some categories while still not being able to confidently specify a single category for the outcome.
Non-stationarity
Sometimes there is not a single process at play, but instead multiple processes or an evolving series of processes that produce successes or failures at different rates through time. For instance, if you're assessing the performance of a newly hired employee on an assembly line, the success rate my improve with the employee’s experience. In such cases, the binomial rate parameter is not a constant, but a changing quantity.
Non-representativeness
Almost all statistical methods assume that the data collected are representative of the underlying process we are trying to measure. But this assumption may not always be true. An employer may have inadvertently chosen to evaluate an employee’s productive quality at a stressful time for the employee, such as during a period of domestic strife. Or the employer may have deliberately chosen to conduct an evaluation at a stressful time, such as after an announcement of impending layoffs. In both cases, the evidence of counting the numbers of employee’s successes and failures may not be representative of the employee’s performance in normal times. Structured sampling can also be non-representative. Making observations only in the early morning may not yield data that properly represent the average performance rates over the course of an entire day.
Quantitatively evaluate a policy that has been in place for 60 months which has led to 15 months that are considered successes, but 45 months that are considered failures. The observations each month were 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, where 0 denotes a failure and 1 a success. How should we characterize the probability of success and our uncertainty about this probability?
Two alternative processes are to be compared. The observations for the first process were 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, which represent 12 successes and 48 failures out of 60 trials. The observations for the second process were 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, which represent 22 successes and 38 failures out of 60 trials. Given that we prefer success to failure, which process should be used for the next trial? Which process yields the better performance? How large is the difference between the two processes? Is this difference statistically significant, i.e., beyond what might be reasonably attributed to chance alone?
Using the data of problem B1, how should we characterize the product (and its uncertainty) of the probabilities of success of the two processes?
In another case, the policy evaluation period was short and only 8 trials could be accumulated, of which 6 were successes. The observations were 1, 0, 1, 1, 1, 1, 0, 1. What is the probability of success, and what is its uncertainty?
We want to quantitatively evaluate the comparison between two new processes, for which only 8 random trials have so been run. The observations for the first process were 0, 0, 0, 1, 0, 1, 1, 1, and the observations for the second process were 1, 1, 1, 1, 1, 1, 1, 1. How large is the difference and does it represent reliable evidence that the second process is actually better than the first?
Using the data of problem B2, what can be said about the product of the probabilities of the two processes?
Using the data of problem B2, what can be said about the probability of joint success which depends on success first with the first process and then with the second process if we cannot be sure that the two processes are independent?
The observations for a new process were 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, which are 38 successes and 52 failures in 90 trials. These observations do not pass a runs test for randomness, which means that sequential values are not independent. What can be said about the probability of success of the next trial drawn from this process?
Twenty trial units of a complex assembly were ordered and produced. The evaluation and testing procedure involves long-term storage of the assemblies, during which 5 of them went missing. We are not sure what might have happened. The assemblies might be missing because the employee constructed them badly and does not want anyone to know. In this situation, the missing assemblies are likely unacceptable. The assemblies might also be missing because they were stolen by an opportunistic burglar. In this situation, acceptable assemblies are no more or less likely to be missing than unacceptable ones. The assemblies might also have been stolen by an inspector secretly involved in industrial espionage. In this situation, the missing assemblies were likely of acceptable quality. An investigation is underway, but in the meantime, because they are missing, we cannot know what their quality is. The observations were 1, NA, 0, 0, 0, 1, 1, 1, NA, NA, 1, 1, 0, NA, 1, 1, 1, 1, 1, NA, which are 11 successes and 4 successes, and 5 missing values, in 20 trials. What can we say about the acceptability rate for these assemblies?
In 20 trials of a new medical procedure, there were 7 cases that were clearly positive outcomes, and 8 cases that were clearly adverse outcomes. In five of these trials, the result was ambiguous and we cannot really tell yet whether the outcome was good or bad. The observations were ?, 1, 0, ?, 0, 1, 1, 0, 0, 0, 1, 0, ?, 0, ?, 1, ?, 0, 1, 1, where ? denotes an ambiguous outcome. What is the probability of the next case yielding a positive outcome? What is the uncertainty about this probability?
The process that generated the observations 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1 (32 successes and 58 failures in 90 trials) seems to be improving. The number of successes in the second half of the data is more than twice that of the first half. Given that the process may not be stationary, how should we characterize the probability of success for the next trial? What is the uncertainty about that probability?
This event is sponsored by the Society for Risk Analysis and its Decision Analysis and Risk Specialty Group (DARSG) and will be held at the Society's 2011 annual meeting with financial support provided by the National Library of Medicine, a component of the National Institutes of Health, through a Small Business Innovation Research grant (award number RC3LM010794) to Applied Biomathematics funded under the American Recovery and Reinvestment Act. The panel discussion will be held at the Embassy Suites Hilton in Charleston, South Carolina, during December. This event is one of many Society for Risk Analysis workshops and Applied Biomathematics workshops.
Continuing discussion at Beyond the statistician's bag of marbles