Post date: Mar 19, 2014 8:34:38 AM
The blog post below just passed initial review as a comment submitted to a journal. In the e-mail that notified me of this decision, it stated I will learn the results from the review process in 6 to 8 weeks. I know that’s relatively fast, but Twitter and blog posts have spoiled me. The internet reminds me every day that my goal is to communicate with my fellow researchers, and published journals are only a, but not necessarily the best, way to do this. It might still appear in a journal, or not, it might be improved by peer review, or not, and you may like it as it is, or not (for comments, talk to me @Lakens).
Readers are likely more familiar with articles that criticize null hypothesis significance testing (NHST) than with articles in support of NHST (e.g., Frick, 1996; Mogie, 2004; Wainer & Robinson, 2003). Articles that question the status quo are bound to receive more attention than more nuanced calls for a unified approach to statistical inferences (e.g., Berger, 2003). This paints a biased picture of disagreement, with a focus on those aspects of statistical techniques in which one approach outperforms another, instead of stressing the relative benefits of using multiple procedures, and teaching individuals how to improve the inferences they draw. For example, two major criticisms against NHST (that the null is never true, and that NHST promotes dichotomous thinking) are easily solved by acknowledging that, even though an effect is often trivially small, it is never exactly 0. Therefore, a statistical test has three possible interpretations (e.g., Jones & Tukey, 2000) by indicating a positive difference, a negative difference, or by indicating the direction of the effect remains undetermined:
1. µ1 - µ2 > 0
2. µ1 - µ2 < 0
3. µ1 - µ2 is undetermined
Other examples of how NHST can be improved are testing hypotheses against minimal (instead of null) effects (see Murphy & Myors, 1999), or using sequential analyses to repeatedly analyze accumulating data (while controlling Type 1 error rates) until the results are sufficiently informative (see Lakens, in press). The lack of attention for such straightforward improvements is problematic, especially since neither confidence intervals nor Bayesian statistics provide full-proof alternatives.
Researchers should always report confidence intervals (CI’s). As Kelley and Rausch (2006) explain, it is misleading to report point estimates without illustrating the uncertainty surrounding the parameter estimate. However, the information expressed by a CI is perhaps even less intuitive than the use of conditional probabilities such as p-values, and might even be more widely misunderstood (see Hoekstra, Morey, Rouder, & Wagenmakers, in press).
As long as selective reporting of performed experiments persists (both through publication bias as through the selection of ‘successful’ studies by individual researchers) confidence intervals in the published literature will be difficult to interpret. For example, although 83.4% (or 5 out of 6) of replication studies will give a value that falls within the 95% CI of the original study, this is only true if the study was one of an infinite sequence of unbiased studies. Given the strong indications of publication bias in psychology (Fanelli, 2010), the correct interpretation of confidence intervals from the published literature is always uncertain. Researchers have proposed a ban on p-values for less problematic issues.
Using Bayesian statistics has many benefits (see Morey, Rouder, Verhagen, & Wagenmakers, 2014). Researchers can make statements about the probability a hypothesis is true given the data (instead of the probability of the observed or more extreme data given a hypothesis and alpha level), provide support for the null-hypothesis (Dienes, 2011), and analyze data repeatedly as the data comes in. These are important benefits, and justify a more widespread use of Bayesian statistics in psychological research. Bayesian statistics are less interesting when Bayes factors are used as a replacement of p-values. When a uniform prior is used, differences between Frequentist inferences and Bayesian inferences are not mathematical, but philosophical in nature (Simonsohn, 2014).
Whenever an informative prior is used, the assumptions about the theory that is tested will practically always leave room for subjective interpretations. For example, Dienes (2011) and Wetzels, et al. (2011) both drew different assumptions about the same theory that was tested and calculated Bayes factors of 4 (substantial evidence for the theory over the null) and 1.56 (barely evidence for the theory over the null), respectively. Based on the psychological literature, we should expect these subjective assumptions to be biased by researchers’ attitudes. Addressing this challenge is not easy, and researchers have proposed a ban on p-values for less problematic issues.
Neither reliance on Bayesian statistics, confidence intervals, or p-values will be sufficient to prevent unwise statistical inferences. As an imaginary example, let’s pretend the evaluation of this blog by 10 of my Bayesians colleagues was substantially less positive (M = 7.7, SD = 0.95) than the evaluation by 10 of my Frequentist colleagues (M = 8.7, SD = 0.82). This difference is statistically significant, t(18) = 2.58, p = .02, and neither the 95% CI around the effect size (dunb = 1.08, [0.16, 2.06], see Cumming, 2014) nor the 95% highest density interval ([0.05, 1.99], see Kruschke, 2011) included 0. Nevertheless, concluding there is something going on would be premature. The v-statistic (Davis-Stober & Dana, 2014) which compares a model based on the data against a model based on random guessing reveals that due to the extremely small sample size, random guessing will outperform a model based on the data 68% of the time (for details, see Lakens & Evers, in press). There will never be a single statistical procedure that will tell us everything we want to know with adequate certainty.
If statisticians had intentionally tried to induce learned helplessness and an escape to dichotomous conclusions based on oversimplified statistical inferences, they could not have done a better job than through the continued disagreement about how to draw statistical inferences from observed data. One might wonder what the practical significance of statisticians is, if they fail to provide “a concerted professional effort to provide the scientific world with a unified testing methodology” (Berger, 2003, p. 4). At the same time, any researcher who unquestioningly believes a p < .05 indicates an effect is likely to be true should be blamed for not spending more time learning statistics. In the end, improving the way we work will only succeed as a collaborative effort relying on a multi-perspective approach.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, doi: 10.1177/0956797613504966
Davis-Stober, C. P., & Dana, J. (2014). Comparing the accuracy of experimental estimates to guessing: a new perspective on replication and the “Crisis of Confidence” in psychology. Behavior Research Methods. DOI 10.3758/s13428-013-0342-1
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science,6, 274 –290. doi:10.1177/1745691611406920
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences. PloS one, 5, e10068.
Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.
Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (in press). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review.
Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5, 411-414.
Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: accuracy in parameter estimation via narrow confidence intervals. Psychological Methods, 11, 363-385.
Kruschke J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312. doi:10.1177/1745691611406925
Lakens, D. (in press). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology.
Lakens, D. & Evers, E. (in press). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science.
Mogie, M. (2004). In support of null hypothesis significance testing. Proceedings of the Royal Society of London Series B-Biological Sciences, 271: S82–S84.
Morey, R. D., Rouder, J. N., Verhagen, J., & Wagenmakers, E.-J. (in press). Why hypothesis tests are essential for psychological science: A comment on Cumming. Psychological Science.
Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234-248.
Simonsohn, U. (2014). Posterior-hacking: Selective reporting invalidates Bayesian results also. Available at SSRN: http://ssrn.com/abstract=2374040
Wainer, H., & Robinson, D. H. (2003). Shaping up the practice of null hypothesis significance testing. Educational Researcher, 32, 22-30.
Wetzels R., Matzke D., Lee M. D., Rouder J. N., Iverson G. J., & Wagenmakers E.-J. (2011). Statistical evidence in experimental psychology. Perspectives on Psychological Science, 6, 291–298.