Post date: Feb 23, 2014 9:11:10 PM
Sebastien, Mohamed:
Okay, regarding stopping rules, here was our bullet on the subject:
The Bernoulli rate c-box can be tighter if the stopping rule was "test until a specified number of successes"
Below are some exchanges we've had on the subject, with numerical examples.
The subject is perhaps not critical for the planned paper, but it may come up, and I wanted to let you know as I said this afternoon that c-boxes are in principle sensitive to the stopping rule.
Cheers,
Scott
posted Sep 10, 2012, 5:21 PM by Scott Ferson Here's another example, similar to Jack's, where the experimental design itself induces a need for imprecise probabilities. Consider the example of the effect of stopping rules on probabilities in significance tests given in the Wikipedia article on the likelihood principle below. If we don't know what stopping rule Adam used, it would seem to make the most sense to admit that the probability in the significance test is imprecise. So we'd know in this case that it is at least the interval [3.27%, 7.3%]. I am not sure, however, that this is the full range for all possible stopping rules in this example. Is there a way to figure that out?
...a significance test depends on the probability of a result as extreme or more extreme than the observation, and that probability may depend on the design of the experiment. ...Suppose I tell you that I tossed a coin 12 times and in the process observed 3 heads. You might make some inference about the probability of heads and whether the coin was fair. Suppose now I tell that I tossed the coin until I observed 3 heads, and I tossed it 12 times. Will you now make some different inference? ...
Suppose...Adam, a scientist, conducted 12 trials and obtains 3 successes and 9 failures. Then he left the lab. Bill, a colleague in the same lab, continued Adam's work and published Adam's results, along with a significance test. He tested the null hypothesis that p, the success probability, is equal to a half, versus p < 0.5. The probability of the observed result that out of 12 trials 3 or something fewer (i.e. more extreme) were successes, if H0 is true, is
which is 299/4096 = 7.3%. Thus the null hypothesis is not rejected at the 5% significance level.
Charlotte, another scientist, reads Bill's paper and writes a letter, saying that it is possible that Adam kept trying until he obtained 3 successes, in which case the probability of needing to conduct 12 or more experiments is given by
which is 134/4096 = 3.27%. Now the result is statistically significant at the 5% level. Note that there is no contradiction among these two results; both computations are correct.
To these scientists, whether a result is significant or not depends on the design of the experiment ...
Similar themes appear when comparing Fisher's exact test with Pearson's chi-squared test.
Shortly before he left, Michael described a really cool example of this with c-boxes for the binomial rate. If we don't know which of two stopping rules was used, the c-box is wider than it would be if you knew the actual stopping rule. The example was incomplete, and not completely explored, because there apparently could be other stopping rules that would change things a lot.
Reply to doubt concerning stopping rules
posted Sep 11, 2012, 2:02 PM by Jack Siegrist
I don't see why the probability calculations given in the example can only be applied to each respective possible stopping rule. It seems to me that each of those probability calculations could be useful for both of the stopping rules, and neither one should be favored for inference for a given stopping rule. Following is an example doing the simulation equivalents with r-code:
Imagine the data were (0 is failure, 1 is success):
0 0 0 0 0 0 0 0 1 0 1 1,
for 12 total experiments.
myData <- c(0,0,0,0,0,0,0,0,1,0,1,1)
(n <- length(myData))
nReps <- 1000
In the example in the previous post, the first probability calculation would be consistent with the following simulation:
# TEST 1
results1 <- matrix(rep(NA, nReps * n), ncol = n)
for(i in 1:nReps){results1[i, ] <- rbinom(n = n, size = 1, prob = .5)}
(test1 <- sum(apply(X = results1, MARGIN = 1, FUN = sum) <= 3) / nReps)
The second probability calculation would be consistent with the next simulation:
# TEST 2
results2 <- rep(NA, times = nReps)
for(i in 1:nReps){
firstDraws <- rbinom(n = 3, size = 1, prob = .5)
cumDraws <- firstDraws
while(sum(cumDraws) < 3){
thisDraw <- rbinom(n = 1, size = 1, prob = .5)
cumDraws <- append(cumDraws, thisDraw)
}
results2[i] <- length(cumDraws)
}
(test2 <- sum(results2 >= n) / nReps)
But what if Jack comes along and says that he thinks the data could have been collected using a stopping rule of stop after you see two successes in a row. This stopping rule could use either of the two previous statistics (fewer than 3 successes and greater than 12 observations before the stopping rule).
# TESTS 3 AND 4
results3 <- rep(NA, times = nReps)
results4 <- rep(NA, times = nReps)
for(i in 1:nReps){
firstDraws <- rbinom(n = 3, size = 1, prob = .5)
cumDraws <- firstDraws
while(sum(cumDraws[(length(cumDraws) - 1):length(cumDraws)]) < 2){
thisDraw <- rbinom(n = 1, size = 1, prob = .5)
cumDraws <- append(cumDraws, thisDraw)
}
results3[i] <- sum(cumDraws)
results4[i] <- length(cumDraws)
}
(test3 <- sum(results3 <= 3) / nReps) # analogous to test1
(test4 <- sum(results4 >= n) / nReps) # analogous to test2
So what is it about the stopping rule that determines which probability is the "correct" one to compute for inference?
Reply to doubt concerning stopping rules
posted Sep 12, 2012, 5:13 PM by James Mickley
While the stopping rule stuff makes a lot of sense, I wonder how much it comes into play in practice. It sort of reminds me of power analysis, which everyone probably should be doing before designing experiments, but in practice isn't paid attention to that often (or so it appears to me). Convenience, cost, and time seem to drive a lot of experimental design up front, so anything to account for stopping rules probably is more useful if it's post hoc.
It also seems like there's potentially an infinite number of stopping rules. The Wikipedia article mentions suddenly having your funding pulled as a potential frequentist problem, and if you're going to account for things like that post hoc, then there's all sorts of stuff that you'd need to account for. Organisms could die, your funding could suddenly be extended or pulled, the experimental design might not work without subsequent modifications, etc.
Going back to the coin example, wouldn't you have to account for stopping after getting [1, infinity] successes, as well as stopping after [1, infinity] trials? What if Adam loses the coin on flip # 5 and thus neither the stopping rule for # of successes or the # of trials applies? What if the coin breaks in half, do you count the outcomes of both halves?
Also, a response to dependent p-values:
Doing meta-analysis on p-values makes me uneasy, isn't meta-analysis normally done on effect size? Does a study with a higher p-value have a larger effect? It seems like using p-values means you're then testing whether or not the significance in a group of experiments is real, rather than whether or not the perceived effect in those experiments is real. That seems more analogous to Bonferroni corrections to me. Also, p-values aren't normally distributed, and I think that needs to be accounted for. At least it does when you're working with metrics of the effects of whatever you're measuring.
posted Sep 13, 2012, 10:08 AM by Scott Ferson
Perhaps the discussion of stopping rules deserves its own Google Site, as it's not really all that germane to the topic of mixing good data with bad. In answer to James' post, I think the issue actually does come up a lot in practice, and it generally becomes an issue after the experiments have been done when there has to be a post hoc accounting of the fact that the planned design didn't quite pan out, or that the details of stopping rule have been lost and no one can quite say anymore what exactly stopped the experiment. I think James is also right that there could be infinitely many stopping rules, but we only care about the classes of these rules that have different consequences for the calculation of p. If two rules, i.e., reasons the experiment stopped, give the same p-value, then they're not different to me. The question is whether these equivalence classes are all over the map, and can lead to vacuous imprecision about p. And, even if so, can some qualitative knowledge about why we stopped somehow rein in that imprecision?
I'm a little confused about why Jack finds the significance calculation not well determined by the stopping rule. Of course there are many statistics that you could define that don't allow you to see any significance in a given data set. (It's never been hard to ignore information.) The question is what should you look at that could reveal any significance if there does happen to be non-randomness present, or in the case of the coin, bias. I don't think that this situation is quite on the order of the Bertrand paradox. So long as you know what the stopping rule is, what's extreme in either case is pretty well defined. When you "stop after 12 flips", the information is how many heads you got. Or, equivalently, how many tails you got. So figuring out significance demands that you ask about the probability of getting fewer heads (or more tails) than you actually got. When you "stop after getting 3 heads", it's not interesting after all that you got 3 heads. The information is that it took you 12 trials to get them. Significance depends on how likely it is you might have had to flip the coin more than 12 times. Jack's other stopping rule ("stop after 2 heads in a row") merely introduces yet another specification for what would make the observations 'extreme'. And, insofar as he suggests that the actual stopping rule that was used is ambiguous, it also poses our original question again: what can we say about the p-values if we don't know the stopping rule?
More broadly, Jack seems to be having trouble with the notion of stopping rules in general and why they should influence probabilities and our statistical inferences. In this regard, Jack is like many Bayesians who look down at significance tests as a bit goofy, in part because they have problems with stopping rules. But, unreconstructed, I continue to defend the legitimacy of the asking questions in this classical way. These are inference problems, not decision problems, and not estimation problems. We want to know whether it is reasonable to believe something is true. We don't currently have any decision context in mind, so nothing's riding on the belief at the moment. Wald's demonstrations notwithstanding, it still seems kosher to me to ask such questions. There are limitations to this approach, sure. And neglecting Type II error completely is surely goofy in decision problem, but that's not what we're talking about here.