The Costs of HARKing: 

Does it Matter if Researchers Engage in Undisclosed Hypothesizing After the Results are Known?

While no-one's looking, a Texas sharpshooter fires his gun at a barn wall, walks up to his bullet holes, and paints targets around them. When his friends arrive, he points at the targets and claims he’s a good shot (de Groot, 2014; Rubin, 2017b). In 1998, Norbert Kerr discussed an analogous situation in which researchers engage in undisclosed hypothesizing after the results are known or HARKing. In this case, researchers conduct statistical tests, observe their results (bullet holes), and then construct post hoc hypotheses (paint targets) to fit these results. In their research reports, they then pretend that their post hoc hypotheses are actually a priori hypotheses. This questionable research practice is thought to have contributed to the replication crisis in science (e.g., Shrout & Rodgers, 2018), and it provides part of the rationale for researchers to publicly preregister their hypotheses before they conduct their analyses (Wagenmakers et al., 2012). In a recent BJPS article (Rubin, 2022), I discuss the concept of HARKing from a philosophical standpoint and then undertake a critical analysis of Kerr’s 12 potential costs of HARKing.

I begin my article by noting that scientists do not make absolute, dichotomous judgements about theories and hypotheses being “true” or “false.” Instead, they make relative judgements about theories and hypotheses being more or less true that other theories and hypotheses in accounting for certain phenomena. These judgements can be described as estimates of relative verisimilitude (Cevolani & Festa, 2018).

I then note that a HARKer is obliged to provide a theoretical rationale for their secretly post hoc hypothesis in the Introduction section of their research report. Despite being secretly post hoc, this theoretical rationale provides a result-independent basis for an initial estimate of the relative verisimilitude of the HARKed hypothesis. (The rationale is "result-independent" because it doesn't formally refer to the current result. If it did, then the rationale's post hoc status would no longer be a secret!) The current result can then provide a second, epistemically independent basis for adjusting this initial estimate of  verisimilitude upwards or downwards (for a similar view, see Lewandowsky, 2019; Oberauer & Lewandowsky, 2019). Hence, readers can estimate the relative verisimilitude of a HARKed hypothesis (a) without taking the current result into account and (b) after taking the current result into account, even if they have been misled about when the researcher deduced the hypothesis. Consequently, readers can undertake a valid updating of the estimated relative verisimilitude of a hypothesis even though, unbeknowst to them, it has been HARKed. Importantly, there's no “double-counting” (Mayo, 2008), “circular reasoning” (Nosek et al., 2018, p. 2600), or violation of the use novelty principle here (Worrall, 1985, 2014), because the current result has not been used in the formal theoretical rationale for the HARKed hypothesis. Consequently, it's legitimate to use the current result to change (increase or decrease) the initial estimate of the relative verisimilitude of that hypothesis.

To translate this reasoning to the Texas sharpshooter analogy, it's necessary to distinguish HARKing from p-hacking. If our sharpshooter painted a new target around his stray bullet hole but retained his substantive claim that he's “a good shot,” then he'd be similar to a researcher who conducted multiple statistical tests and then selectively reported only those results that supported their original a priori substantive hypothesis. Frequentist researchers would call this researcher a “p-hacker” rather than a HARKer (Rubin, 2017b, p. 325; Stefan & Schönbrodt, 2023, p. 3). To be a HARKer, researchers must also change their original a priori hypothesis or create a totally new one. Hence, a more appropriate analogy is to consider a sharpshooter who changes both their statistical hypothesis (i.e., paints a new target around their stray bullet hole) and their broader substantive hypothesis (their claim). Let's call the sharpshooter in this revised analogy Jane!

Jane initially believes “I’m a good shot” (H1). However, after missing the target that she was aiming for (T1), she secretly paints a new target (T2) around her bullet hole and declares to her friends: "I'm a good shot, but I can't adjust for windy conditions. I aimed at T1, but there was a 30 mph easterly cross-wind, and that's how I knew in advance that I'd hit T2 instead." In this case, Jane has generated a new, post hoc hypothesis (H2) and passed it off as an a priori hypothesis. Note that Jane isn't being deceptive about her procedure here (i.e., what she actually did): It's true that she aimed her gun at T1, and it's true that there was a cross-wind. She's only being deceptive about the a priori status of H2, which she secretly developed after she missed T1 (i.e., she's HARKing). Importantly, however, Jane's deception doesn't prevent her friends from making a valid initial estimate of the verisimilitude of her HARKed hypothesis and then updating this estimate based on the location of her bullet hole:

"We know that Jane's always trained indoors. So, it makes sense that she hasn't learned to adjust for windy conditions. We also know that (a) Jane was aiming at T1, and (b) there was a 30 mph easterly cross-wind. Our calculations show that, if someone was a good shot, and they were aiming at T1, but they didn't adjust for an easterly 30 mph cross-wind, then their bullet would hit T2's location. So, our initial estimated verismilitude for H2 is relatively high. The evidence shows that Jane's bullet did, in fact, hit T2. Consequently, we can tentatively increase our support for H2: Jane appears to be a good shot who can't adjust for windy conditions. Of course, we'd also want to test H2 again by asking Jane to hit targets on both windy and non-windy days!"

The Texas Sharpshooter Fallacy

We can predict the location of the sharpshooter's bullet hole on the basis of her (secretly HARKed) hypothesis that she is a good shot but cannot adjust for windy conditions. We can then use the location of the bullet hole to increase or decrease our estimated relative verisimilitude for this prediction. 


Source: https://pixabay.com/photos/woman-rifle-shoot-gun-weapon-2577104/ 

The second part of my paper provides a critical analysis of Kerr’s (1998) 12 costs of HARKing. Kerr’s costs and my responses are summarised below. The text is sourced from my article.

 Kerr (1998)

 Response

1. Translating Type I errors into hard-to-eradicate theory.

The overfitting of post hoc hypotheses to Type I errors is not possible when those hypotheses are deduced from a priori theory and evidence. Flexible theorizing is possible, but it can be identified and taken into account in estimates of relative verisimilitude. 

2. Propounding theories that cannot (pending replication) pass Popper’s disconfirmability test. 

Popper’s disconfirmability test refers to the content of hypotheses, not the timing of the construction of those hypotheses. HARKed hypotheses can pass this test. 

3. Disguising post hoc explanations as a priori explanations (when the former tend also be [sic] more ad hoc, and consequently, less useful). 

Deducing post hoc explanations from a priori theory and evidence provides an epistemically independent basis for estimating relative verisimilitude that prevents overfitting and validates post hoc predictions. 

4. Not communicating valuable information about what did not work. 

The potential costs associated with biased research conclusions, missing null results, and/or unreported failed methods are alleviated via pre- and post-publication peer review and the public availability of research materials and data.

5. Taking unjustified statistical licence. 

Post hoc theoretical rationales that are deduced from a priori theory and evidence can provide valid justifications for directional statistical tests. 

6. Presenting an inaccurate model of science to students. 

Publication bias presents an inaccurate model of science to students. HARKing is only a response to publication bias. 

7. Encouraging “fudging” in other grey areas. 

There is limited evidence that HARKing encourages the use of inappropriate research practices. 

8. Making us less receptive to serendipitous findings. 

Lack of receptiveness to serendipitous findings is caused by the scientific community valuing a priori prediction over post hoc prediction. HARKing solves this problem rather than causes it. 

9. Encouraging adoption of narrow, context-bound new theory. 

Even if HARKing is associated with low quality theorizing, this problem is taken into account in readers’ estimates of relative verisimilitude. 

10. Encouraging retention of too-broad, disconfirmable old theory. 

Too-broad, disconfirmable theory will receive low ratings of relative verisimilitude. 

11. Inhibiting identification of plausible alternative hypotheses. 

The scientific publication system requires researchers to consider alternative explanations in their research reports. HARKing facilitates this process in a system that discourages the publication of post hoc hypothesis testing. 

12. Implicitly violating basic ethical principles. 

The ethical principle of honesty and openness applies to information that forms a useful part of the truth. The timing of the construction of a hypothesis does not form a useful part of the truth. 

In summary, I argue that HARKing conceals the timing of a researcher’s personal hypothesizing, but it doesn't conceal the quality of (a) the hypothesizing, (b) the research methodology, or (c) the statistical analysis. Readers can make judgements about the quality of each these aspects of the research without knowing the timing of the researcher’s hypothesizing. So, even if readers are unaware that a hypothesis has been HARKed, they are still able to criticize (a) the theoretical quality of the HARKed hypothesis (e.g., too-broad theorizing, too-narrow theorizing, or ignores alternative explanations; Costs 9, 10, & 11), (b) the appropriateness of the methodology for testing those hypotheses (Rubin, 2017b, p. 312), and (c) the appropriateness of the statistical analyses (lack of correction for multiple testing, lack of justification for directional tests; Costs 1 & 5). Hence, I argue that Costs 1, 5, 9, 10, and 11 are misattributed to HARKing when they are actually criticisms of the quality of the hypothesizing and data analysis. Similarly, Costs 6 and 8 are misattributed to HARKing when they are actually due to publication bias (Cost 6) and the scientific community’s preference for a priori prediction over post hoc prediction (Cost 8).

I propose that Cost 2 is misconceived: HARKed hypotheses can be constructed to be disconfirmed, as well as confirmed, by the research results, and they do not necessarily fail Popper’s disconfirmability test. Costs 1 and 3 fail to recognize that the act of disguising ad hoc accommodation as prediction by deducing hypotheses from a priori theory and evidence effectively turns accommodation into post hoc prediction and, in doing so, precludes overfitting. 

The suppression of a priori hypotheses (Cost 4) may lead to biased research conclusions, missing null results, and/or unreported failed methods. However, biased conclusions can be addressed through pre- and post-publication peer review, and missing information can be addressed by making research materials, data, and coding information and publicly available.

Cost 7 lacks empirical evidence: As I explained in my paper, there is no clear support for Kerr’s (1998) slippery slope argument that HARKing encourages fudging in other areas. Finally, HARKing cannot be considered to be unethical if it conceals information that is uninformative, and I argue that the timing of the hypothesizing may be considered to be scientifically uninformative (Cost 12).

Given the potentially limited costs of HARKing to the scientific process, I argue that it is premature to conclude that HARKing is an important contributor to the replication crisis in science.

References

Cevolani, G., & Festa, R. (2018). A partial consequence account of truthlikeness. Synthese. http://dx.doi.org/10.1007/s11229-018-01947-3 

de Groot, A. D. (2014). The meaning of “significance” for different types of research (E. J. Wagenmakers, D. Borsboom, J. Verhagen, R. Kievit, M. Bakker, A. Cramer, . . . H. L. J. van der Maas). Acta Psychologica, 148, 188–194. http://dx.doi.org/10.1016/j.actpsy.2014.02.001 

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2, 196-217. http://dx.doi.org/10.1207/s15327957pspr0203_4 

Lewandowsky, S. (2019). Avoiding Nimitz Hill with more than a little red book: Summing up #PSprereg. https://featuredcontent.psychonomic.org/avoiding-nimitz-hill-with-more-than-a-little-red-book-summing-up-psprereg/ 

Mayo, D. G. (2008). How to discount double-counting when it counts: Some clarifications. The British Journal for the Philosophy of Science, 59, 857–879.

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115, 2600-2606. http://dx.doi.org/10.1073/pnas.1708274114 

Oberauer, K., & Lewandowsky, S., (2019). Addressing the theory crisis in psychology. Psychonomic Bulletin & Review. http://dx.doi.org/10.3758/s13423-019-01645-2 

Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman-Pearson approach. Review of General Psychology, 21, 321-329. http://dx.doi.org/10.1037/gpr0000135     *Self-archived version*

Rubin, M. (2017b). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21, 308-320. http://dx.doi.org/10.1037/gpr0000128    *Self-archived version

Shrout, P. E., & Rodgers, J. L. (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487-510. http://dx.doi.org/10.1146/annurev-psych-122216-011845 

Stefan, A. M., & Schönbrodt, F. D. (2023). Big little lies: A compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2), Article 220346. https://doi.org/10.1098/rsos.220346 

Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632-638. http://dx.doi.org/10.1177/1745691612463078 

Worrall, J. (1985). Scientific discovery and theory-confirmation. In J. C. Pitt (Ed.), Change and progress in modern science: Papers related to and arising from the Fourth International Conference on History and Philosophy of Science (pp. 301–331). Dordrecht, the Netherlands: Reidel. http://dx.doi.org/10.1007/978-94-009-6525-6_11 

Worrall, J. (2014). Prediction and accommodation revisited. Studies in History and Philosophy of Science, 45, 54–61. http://dx.doi.org/10.1016/j.shpsa.2013.10.001 

Further Information

Article

Rubin, M. (2022). The costs of HARKing. The British Journal for the Philosophy of Science, 73(2), 535-560. https://doi.org/10.1093/bjps/axz050    *Self-archived version*


Twitter Thread

There is also a Twitter thread about this issue here.


Reviews

Brian Haig commented, "Mark Rubin's, 'The costs of HARKing', is just the sort of nuanced philosophical analysis that we need of our understanding of questionable research practices, and other methodological concepts, more generally." https://twitter.com/BrianHaig/status/1338324169621004288