Do p Values Lose their Meaning in Exploratory Analyses?

In Rubin (2017), I consider the idea that p values lose their meaning (become invalid) in exploratory analyses (i.e., non-preregistered analyses). I argue that this view is correct if researchers aim to control a familywise error rate that includes all of the hypotheses that they have tested, or could have tested, in their study (i.e., a universal, experimentwise, or studywise error rate). In this case, it is not possible to compute the required familywise error rate because the number of post hoc hypotheses that have been tested, or could have been tested, during exploratory analyses in the study is unknown. However, following numerous others (see Appendix A), I argue that researchers are rarely interested in a studywise error rate because they are rarely interested in testing the joint studywise hypothesis to which this error rate refers. 

For example, imagine that a researcher conducted a study in which they explored the associations between body weight and (1) gender, (2) age, (3) ethnicity, and (4) social class. This researcher is unlikely to be interested in a studywise null hypothesis that can be rejected following a significant result for any of their four tests, because this joint null hypothesis is unlikely to relate to any meaningful theory. Which theory proposes that gender, age, ethnicity, and social class all predict body weight for the same theoretical reason? And, if the researcher is not interested in making a decision about the studywise null hypothesis, then there is no need for them to lower the alpha level (α; the significance threshold) for each of their four tests (e.g., from α = .050 to α = .050/4 or .0125) in order to maintain the Type I error rate for their decision about the studywise hypothesis at α = .050. Instead, the researcher can test each of the four different associations individually (i.e., each at α = .050) in order to make a separate claim about each of four theoretically independent hypotheses (e.g., "men weigh more than women, p = .021"; "older people weigh more than younger people, p = .004"; etc. see Appendix B). By analogy, a woman who takes a pregnancy test does not need to worry about the familywise error rate that either her pregnancy test, or her fire alarm, or her email spam filter will yield a false positive result because the associated joint hypothesis is nonsensical. 

The experimentwise error rate sometimes doesn't make sense

Sometimes it doesn't make sense to combine different hypotheses as part of the same family!

Researchers should only be concerned about the familywise error rate of a set of tests when that set refers to the same theoretically meaningful joint hypothesis. For example, a researcher who undertakes exploratory analyses should be concerned about the familywise error rate for the hypothesis that men weigh more than women if they use four different measures of weight, and they are prepared to accept a single significant difference on any of those four measures as grounds for rejecting the associated joint null hypothesis. In this case, they should reduce their alpha level for each constituent test (e.g., to α/4) in order to maintain their nominal Type I error rate for the joint hypothesis at α. Based on this reasoning, I argue that p values do not lose their meaning in exploratory analyses because (a) researchers are not usually interested in the studywise error rate, and (b) they are able to transparently and verifiably specify and control the familywise error rates for any theoretically meaningful post hoc joint hypotheses about which they make claims.

I also recommend that researchers undertake a few basic open sciences practices during exploratory analyses in order to alleviate concerns about potential p-hacking: (1) List all of the variables in the research study. (2) Undertake a sensitivity analysis to demonstrate that the research results are robust to alternative analytical approaches. (3) Make the research data and materials publicly available to allow readers to check whether the results for any relevant measures have been omitted from the research report. 

Further Information

Article

Rubin, M. (2017). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21(3), 269-275. https://doi.org/10.1037/gpr0000123    *Self-archived version*


Review and Response

Members of the Everything is Fucked group discussed my paper here, and I provided a response here.


Twitter Discussions

There are also Twitter discussions about this paper here, here, and here.

Appendix A: Some Quotes About the Irrelevance of Studywise Hypotheses

(1) Fisher (1937, p. 217, my emphasis):

Where a number of independent tests of significance have been made, on data from the same experiment, each test allowing of the rejection of the true hypotheses in 5 per cent. of trials, it follows that a hypothesis specifying all the differences in yield between the treatments tested will, although true, be rejected with a higher frequency. If, therefore, it were desired to examine the possible variations of any hypothesis which specified all these differences simultaneously while maintaining the 5 per cent. level of significance, a different procedure should be adopted. Actually, in biology or in agriculture, it is seldom that the hypothetical background is so fully elaborated that this is necessary. It is, therefore, usually preferable to consider the experiment, as we have done above, as throwing light upon a number of theoretically independent questions.

 

(2) Cox (1965, p. 223):

A probability of error referring to the simultaneous correctness of a set of statements seems relevant only if a certain conclusion depends directly on the simultaneous correctness of a set of the individual statements.

 

(3) Hochberg and Tamrane (1987, p. 6, p. 7):

If…inferences are unrelated in terms of their content or intended use (although they may be statistically dependent), then they should be treated separately and not jointly....

 

Not all inferences made in a given experiment may constitute a single family.

 

(4) Hancock and Klockars (1996, p. 270):

An entire experiment can consist of a vast collection of comparisons and contrasts, not all of which conceptually belong to the same family.

 

(5) Perneger (1998, p. 1236):

The first problem is that Bonferroni adjustments are concerned with the wrong hypothesis.4–6 The study­wide error rate applies only to the hypothesis that the two groups are identical on all 20 variables (the universal null hypothesis). If one or more of the 20 P values is less than 0.00256, the universal null hypothesis is rejected. We can say that the two groups are not equal for all 20 variables, but we cannot say which, or even how many, variables differ. Such information is usually of no interest to the researcher, who wants to assess each variable in its own right.

 

(6) Bender and Lange (2001, p. 343):

Frequently, the global null hypothesis, that all individual null hypotheses are true simultaneously, is of limited interest to the researcher.

 

(7) Hewes (2003, p. 450):

Experiment-wise error is a useful concept only if there is something about the experiment as a whole that transcends the information in the individual hypothesis tests....Although we can compute experiment-wise error rates and control them for a collection of randomly selected hypotheses, there is simply no point in doing so. Uncorrected αs are enough

 

(8) Schulz and Grimes (2005, p. 1592):

Bonferroni adjustment, however, usually addresses the wrong hypothesis.1,6 It assumes the universal null hypothesis which, simply defined, tests that two groups are identical for all the primary endpoints investigated versus the alternative hypothesis of an effect in one or more of those endpoints. That usually poses an irrelevant question in medical research. Clinically, a similar idea would be: “. . . the case of a doctor who orders 20 different laboratory tests for a patient, only to be told that some are abnormal, without further detail."

 

(9) Morgan (2007, p. 34):

The Bonferroni adjustment attempts to remove the study-wide error rate across a wide range of independent tests. If significance is detected, and the ‘‘null hypothesis’’ seemingly rejected, this leaves us ignorant as to which of the individual tests are significant and which are not. We can only conclude that the ‘‘universal’’ null hypothesis is rejected. But the universal null hypothesis—that some of the individual tests are significant in some way—is of no clinical interest. Clinical interest is served by knowing which test is significant and in what way.

 

(10) Rothman, Greenland, and Lash (2008, pp. 236-237):

A large health survey or cohort study may collect data pertaining to many possible associations, including data on diet and cancer, on exercise and heart disease, and perhaps many other distinct topics. A researcher can legitimately deny interest in any joint hypothesis regarding all of these diverse topics, instead wanting to focus on those few (or even one) pertinent to his or her specialities. In such situations, multiple-inference procedures…are irrelevant, inappropriate, and wasteful of information.

 

(11) ME! - Rubin (2017, p. 271):

Universal null hypotheses are unlikely to be associated with theoretically meaningful alternative hypotheses.

 

(12) Oberauer & Lewandowsky (2019, p. 1609)

Case 2 is the situation where a researcher tests multiple hypotheses, testing each of them with only one analytical approach (e.g., running a standard significance test on each of 100 correlation coefficients). This scenario does not lead to an inflated Type I error rate for each individual hypothesis. It does increase the chance of committing at least one Type I error among all hypotheses tested, and as such it increases the Type I error rate for the “joint null hypothesis” (de Groot, 1956/2014), which states that all individual hypotheses tested are false. But the joint null hypothesis—or its negation, the claim that “at least one of the n alternative hypotheses tested is true”—is rarely of scientific interest.


(13) Parker and Weir (2020, p. 2):

If treatments are distinct and we are interested in individual treatment versus control comparisons,…then it is difficult to see how the concept of formulating a global intersection null hypothesis could be relevant. If the global intersection null hypothesis is not relevant, then neither is the FWER [familywise error rate].

 

(14) ME! - Rubin (2020, p. 382):

The familywise error rate for the entire set of tests in an exploratory data analysis is only relevant if researchers are interested in testing a joint null hypothesis that may be rejected following at least one significant result in this analysis. In practice, researchers are unlikely to be interested in this studywise error rate, because the associated studywise hypothesis is not likely to be theoretically meaningful (Rubin, 2017a, 2017b, 2019a).


(15) ME! - Rubin (2021, p. 10991):

In general then, researchers should not be concerned about erroneous answers to questions that they are not asking. In other words, they should not be concerned about the familywise error rate for a joint studywise null hypothesis that they are not, in fact, testing.


(22) García-Pérez (2023, p. 10):

Corrections for multiple testing serve the only purpose of ensuring that the Type-I error rate stays at α when testing omnibus joint intersection nulls via surrogates. This limits the definition of multiple testing to the context of testing nulls of that type, not to be mistakenly generalized to testing any diversity of unconnected nulls over the course of a study.

Appendix B: Some Quotes About the Appropriateness of Individual Testing

(1) Tukey (1953, p. 82): Tukey believed that the per determination error rate was “entirely appropriate” for some research questions (i.e., individual testing). He illustrate his point with reference to the example of a doctor diagnosing potentially diabetic patients based on their blood sugar levels:

The doctor’s action on John Jones would not depend on the other 19 determinations made at the same time by the same technician or on the other 47 determinations on samples from patients in Smithville.  Each determination is an individual matter, and it is appropriate to set error rates accordingly.

 

(2) Wilson (1962, p. 299):

The per-experiment approach seems to discourage extensive studies because the more extensive the study the less the likelihood of being able to accept any given hypothesis as correct….There are strong advantages to the per-hypothesis solution. The basic question is, what is the most meaningful unit in which to evaluate research? Traditional practice apparently has chosen the hypothesis as the unit and this paper maintains that this is the correct choice. It seems that the hypothesis is psychologically the more logical unit.

 

(3) Rothman (1990, p. 45):

Without a firm basis for posing a universal null hypothesis, the adjustments based on it are counterproductive. Instead, it is always reasonable to consider each association on its own for the information it conveys.

 

(4) Savitz and Olshan (1995, p. 906):

Abandonment of the universal null hypothesis as a benchmark for testing requires an alternative approach to the analysis of data that can address many hypotheses. The alternative approach is simply to focus on each specific hypothesis, even if there are many, and to evaluate the quality of the results of the study and their compatibility with other evidence only with respect to that specific hypothesis

 

(5) Cook and Farewell (1996, pp. 96-97):

In clinical trial designs formally based on two or more responses, multiplicity adjustments may not be necessary if marginal, or separate, test results are interpreted marginally and have implications in different aspects of the prescription of the treatments (i.e. response-specific effects are of interest and separate statements regarding them are desired). Thus there may be contexts in which multiple tests of significance should be performed with reference to marginal rather than experimental error rates. If hypothesis tests are primarily directed at marginal inferences, it is then reasonable to specify a maximum tolerable error rate for each specific hypothesis test.

 

(6) Hewes (2003, p. 450):

The αs for the individual hypotheses, before or without error correction, accurately represent the risks of Type I error for each statistical test of significance. If we are interested in the knowledge acquired from each hypothesis taken one at a time, the use of experiment-wise error control is entirely irrelevant. 

 

(7) Turkheimer, Aston, and Cunningham (2004, p. 727):

If before or after testing one wishes to consider the individual result on its own individual merit, then the multiple comparison correction becomes not only incorrect but also meaningless.

 

(8) Veazie (2006, p. 809):

If a conclusion would follow from a single hypothesis fully developed, tested, and reported in isolation from other hypotheses, then a single hypothesis test is warranted.

 

(9) Matsunaga (2007, p. 255):

If multiple H0s are tested, inflation is of no concern because Type I errors are partitioned per H0, each of which entails distinct alphas. If multiple tests are carried out within one H0, however, overall Type I error rate for that H0 becomes inflated and adjustment needs to be made.

 

(10) Senn (2007; pp. 150-151):

However many tests one carries out, the probability of making at least one type I error per test is not increased. Therefore, it can be claimed that, if all tests conducted are reported and the trialist takes the rough with the smooth, considering not only significant but nonsignificant results, then there should be no problem.

 

(11) Hurlbert and Lombardi (2012, p. 30):

Whatever statistical tests are dictated by the objectives and design of a study are best carried out one-by-one without any adjustments for multiplicities, whether these derive from there being multiple treatments, multiple monitoring dates or multiple response variables.

 

(12) Sinclair, Taylor, and Hobbs (2013, p. 19):

It could be argued that a researcher should be most interested in examining individual hypotheses, and that examining the so called composite hypothesis is rarely of practical or scientific concern.

 

(13) Armstrong (2014, p. 505):

No correction would be advised in the following circumstances…if multiple usage of a simple test such as ‘t’ or ‘r’ is envisaged, if it is the results of the individual tests that are important. Instead, the exact p values for each individual test should be quoted and discussed appropriately.

 

(14) ME! - Rubin (2017, p. 272):

A researcher who undertakes single tests of 100 different null hypotheses will have a relatively high probability of incorrectly rejecting one of those hypotheses, but she will not increase the probability of incorrectly rejecting each hypothesis.

 

(15) Parker and Weir (2020, p. 564):

If a type I error rate is of great concern for a given treatment, then this should be addressed by the individual α-levels themselves – not indirectly via controlling the overall FWER [familywise error rate].

 

(16) ME! - Rubin (2020, p. 380):

It is also important to distinguish between multiple testing and multiple cases of individual testing. Multiple testing occurs when several tests are used to make a single claim about a joint hypothesis. In contrast, individual testing occurs when a single test is used to make a single claim about an individual hypothesis (Rubin, 2017b, 2019a; Tukey, 1953, pp. 82-83). Familywise error rates and alpha adjustments are only required in the case of multiple testing. They are not required in the case of individual testing, even if multiple cases of individual testing occur within the same study.

 

(17) Greenland (2020, p. 5):

Suppose we have a family of K individual (single) hypotheses HT1, …, HTK, with the joint hypothesis Hjoint that all K of the single hypotheses HTk are correct; that is, Hjoint is “HT1 and and HTK.” Some authors insist incorrectly (albeit often implicitly and unknowingly) on imposing MCAs [multiple comparison adjustments] which are devoted to keeping the probability of rejecting Hjoint below an individually determined error rate αsingle, regardless of the actual study goals. Such insistence misdirects attention from the entire set of K hypotheses onto the one joint hypothesis Hjoint.

 

By undertaking an exploration, one must accept that some error will be nearly inevitable. A 5% per individual hypothesis false rejection rate (setting αsingle = 0.05) may be perfectly acceptable depending on the study goals and background expectations, especially if the cost of false rejections (“false positives”) is not more than the cost of false negatives. As an example, suppose K = 20 and we proceed with αsingle = 0.05 as the maximum acceptable false rejection (type I) error rate for individual hypotheses. Then, if the joint null Hjoint is correct we should expect at least one false rejection among the single α- level tests 1 − 0.9520 = 64% of the time. This 64% is indeed the false rejection rate for testing Hjoint based on this naïve test: “Reject Hjoint if at least one of the single tests rejects at the 0.05 level.”

 

This 64% sounds like a high overall error rate; that impression is, however, an illusion created by thinking the single-hypothesis αsingle = 0.05 is targeting the multiple hypothesis Hjoint or should be equated to the maximum acceptable error rate αjoint for Hjoint. The question αsingle is targeting, however, is not whether the joint null Hjoint is false, but rather which if any of the HTk are false. Because αsingle is the individual false rejection rate, it follows that, if all the tested hypotheses are correct and thus the joint null is correct, using αsingle = 0.05 we should expect only one false rejection among the 20 hypotheses tested, which is precisely the acceptable error rate implied by setting αsingle = 0.05 in the first place. The question of whether Hjoint is correct need not enter and is indeed irrelevant if our concern is to simply screen the HTk while keeping the individual false rejection rate within this family at 5% or less.


(18) ME! - Rubin (2021a, p. 3):

It is only necessary to lower the significance threshold when undertaking multiple tests of a single joint null hypothesis using an union-intersection approach. It is not necessary to lower the significance threshold when undertaking single tests of multiple individual null hypotheses.


(19) ME! - Rubin (2021b, p. 10991):

If each decision to reject each individual null hypothesis depends on no more than one significance test, then none of the individual tests constitute a “family” with respect to any single hypothesis. Consequently, it is not necessary to adjust alpha levels on the basis of any family-based error rate (e.g., familywise error rate, per family error rate, etc.; Hurlbert & Lombardi, 2012, p. 30).


(20) Molloy et al. (2022, p. 2):

If drugs with different mechanisms of action are evaluated in the same trial, we believe that control for multiplicity is not required, just as if they were evaluated in separate trials.

 

(21) Parker and Weir (2022, p. 2):

The “per-comparisonwise error rate” (PCWER), does not increase with multiple testing [2, 21, 22]. That means that if we adopt a precise, focused interpretation of the individual results, then there is no need to either apply a multiplicity adjustment or downgrade our interpretation to “exploratory.”


(22) García-Pérez (2023, p. 15):

In general, avoid corrections for multiple testing if statistical claims are to be made for each individual test in the absence of an omnibus null hypothesis about which all of the tests speak collectively and for which the Type-I error rate will be α by these corrections.