p-hacking in Different Philosophies of Significance Testing

p-hacking occurs when researchers conduct multiple significance tests and then selectively report tests that yield desired, usually significant, results without correcting for multiple testing. Hence, p-hacking represents a type of undisclosed cherry-picking or fishing for specific (significant) results.

Although p-hacking is often described as a “questionable research practice” (John et al., 2012), it is widely regarded as being statistically and ethically problematic (e.g., Miller et al., 2025; Pickett & Roche, 2018). It is also believed to be a major contributor to the replication crisis (e.g., Bishop, 2019). In particular, p-hacking is thought to inflate Type I error rates above their nominal conventional level, resulting in a larger proportion of false positive results in the literature than would otherwise be expected. This disproportionate number of false positives is then thought to cause unexpectedly low replication rates.

In this recent preprint, I aim to add some nuance to this view by distinguishing between two philosophies of significance testing — the error statistical approach (Mayo, 1996, 2018) and the formal inference approach (Rubin, 2021, 2024a, 2024b). I argue that, although p-hacking inflates Type I error rates in the error statistical approach, it does not inflate them in the formal inference approach.

To illustrate, imagine the following example of p-hacking: A researcher conducts a two-sided independent samples t test using a conventional alpha level of 0.05. They fail to find a significant result with regard to their first null hypothesis H0,1: t(326) = 1.88, p = 0.061. For that reason, they remove an outlier from their sample and conduct a second test. The second test has a smaller degrees of freedom than the first, and it is formally specified by a different test procedure (i.e., outliers removed). Hence, it tests a different statistical null hypothesis, H0,2. This time, the researcher finds a significant result, t(325) = 2.16, p = 0.032, which they then report without disclosing the nonsignificant result from their first test.

From an error statistical perspective, the researcher has conducted two significance tests (p1;H0,1 and p2;H0,2) and then selectively reported whichever test(s) yielded a significant result (Mayo, 1996, pp. 303–304, 348; Mayo, 2018, pp. 274–275). In this case, the “actual” sampling distribution is given under the “actual” test procedure’s “global” or “universal” intersection null hypothesis, H0,1 ∩ H0,2 (Mayo, 2018, p. 276). Consequently, the “actual” error rate is the familywise error rate 0.098 (i.e., 1 − [1 − 0.05]2), which is inflated above the nominal error rate of 0.05.

In contrast, from a formal inference perspective, the relevant sampling distribution is given under H0,2, not H0,1 ∩ H0,2, because the researcher only reported an individual inference about H0,2, not a union-intersection inference about H0,1 ∩ H0,2. Consequently, the relevant test procedure is the formally reported procedure for H0,2, in which any outliers are removed, and the relevant Type I error rate is the nominal error rate (Rubin, 2021, p. 10991; Rubin, 2025, p. 10).

Importantly, from a formal inference perspective, it is not appropriate to compare the “actual” familywise error rate for a decision about H0,1 ∩ H0,2 with the nominal error rate for a decision about H0,2 and argue that the former represents an “inflated” version of the latter. These two error rates are incommensurate with one another because they refer to two separate decisions about two different statistical null hypotheses based on two different statistical models. Arguing that a test’s Type I error rate has been inflated because another test has a larger error rate is like arguing that a person’s height has been inflated because their friend is taller than them!

It is also logically inconsistent to use a familywise error rate to license an individual inference. Specifically, we would be making a fallacy of division or ecological fallacy in this case because we would be misapplying an aggregate-level union probability to an individual member of the aggregate (Selvin, 1958; Waller, 2018). In particular, the union probability of obtaining at least one significant result given H0,1 ∩ H0,2 (i.e., the familywise error rate) does not represent the individual probability of obtaining a significant result given H0,2 alone (García-Pérez, 2023; Rubin, 2021, pp. 10978–10983; Rubin, 2024a, p. 3; Rubin, 2024b, p. 51).

The familywise error rate would only be relevant if H0,2 was treated as a logically exchangeable constituent of H0,1 ∩ H0,2, rather than as a distinct individual hypothesis. Here, the researcher would make a union-intersection inference about H0,1 ∩ H0,2 as a whole, rather than an individual inference about H0,2 alone, and this inference would be based on at least one significant result among p1;H0,1 and p2;H0,2 (García-Pérez, 2023, p. 2; Rubin, 2021, p. 10981; Rubin, 2024a, p. 2). In this case, they would conclude that "either H0,1 or H0,2 or both are false" rather than "H0,2 is false."

In summary, from an error statistical perspective, (a) the “actual” test procedure includes two significance tests: p1;H0,1 and p2;H0,2; (b) the “actual” sampling distribution is given under the intersection null hypothesis H0,1 ∩ H0,2; and so (c) the “actual” Type I error rate is the familywise error rate of 0.098 (i.e., 1 − [1 − 0.05]2), which is “inflated” relative to the nominal error rate of 0.05. In contrast, from a formal inference perspective, (a) the formally reported inference is an individual inference about the individual null hypothesis H0,2; (b) the relevant sampling distribution for this inference is given under H0,2; and so (c) the relevant Type I error rate is p2;H0,2’s nominal error rate of 0.05, which is incommensurate with the familywise error rate for a decision about H0,1 ∩ H0,2. Figure 1 illustrates these differences.

Figure 1 p-Hacking in the Formal Inference and Error Statistical Approaches

The contrast between the error statistical and formal inference approaches has implications for the way in which we conceptualize p-hacking. In particular, p-hacking may be viewed as being more or less problematic for Type I error rates depending on one’s philosophy of significance testing. Accordingly, it would be helpful for discussions about the potential dangers of p-hacking to be more clearly situated within relevant philosophies of significance testing in order to articulate otherwise implicit assumptions.

Further Information

Article

Rubin, M. (2026). p-hacking inflates Type I error rates in the error statistical approach but not in the formal inference approach. ArXiv. https://doi.org/10.48550/arXiv.2602.21792

References

Bishop, D. V. (2019). Rein in the four horsemen of irreproducibility. Nature, 568(7753), 435–436. https://www.nature.com/articles/d41586-019-01307-2

García-Pérez, M. A. (2023). Use and misuse of corrections for multiple testing. Methods in Psychology, 8, 100120. https://doi.org/10.1016/j.metip.2023.100120

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Mayo, D. G. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press.

Miller, J. D., Phillips, N. L., & Lynam, D. R. (2025). Questionable research practices violate the American Psychological Association’s Code of Ethics. Journal of Psychopathology and Clinical Science, 134(2), 113–114. https://doi.org/10.1037/abn0000974

Pickett, J. T., & Roche, S. P. (2018). Questionable, objectionable or criminal? Public opinion on data fraud and selective reporting in science. Science and Engineering Ethics, 24, 151–171. https://doi.org/10.1007/s11948-017-9886-2

Rubin, M. (2021). When to adjust alpha during multiple testing: A consideration of disjunction, conjunction, and individual testing. Synthese, 199, 10969–11000. https://doi.org/10.1007/s11229-021-03276-4

Rubin, M. (2024a). Inconsistent multiple testing corrections: The fallacy of using family-based error rates to make inferences about individual hypotheses. Methods in Psychology, 10, 100140. https://doi.org/10.1016/j.metip.2024.100140

Rubin, M. (2024b). Type I error rates are not usually inflated. Journal of Trial and Error, 4(2), 46–71. https://doi.org/10.36850/4d35-44bd

Rubin, M. (2025). Preregistration does not improve the transparent evaluation of severity in Popper’s philosophy of science or when deviations are allowed. Synthese, 206, 111. https://doi.org/10.1007/s11229-025-05191-4

Selvin, H. C. (1958). Durkheim’s suicide and problems of empirical research. American Journal of Sociology, 63(6), 607–619. https://doi.org/10.1086/222356

Waller, J. (2018). Division. In R. Arp, S. Barbone & M. Bruce (Eds.), Bad arguments (pp. 259–260). Wiley. https://doi.org/10.1002/9781119165811.ch56

Google Sites

Report abuse