Working Papers

Lakens, D., McLatchie, N., Isager, P. M., Scheel, A. M., & Dienes, Z. (2018). Improving Inferences about Null Effects with Bayes Factors and Equivalence Tests. PsyArXiv.
Delacre, M., Lakens, D., Mora, Y., Leys, C. (submitted). Why Psychologists Should Always Report the W-test Instead of the F-Test ANOVA. Pre-Print:
Morey, R. D., & Lakens, D. (revise and resubmit). Why most of psychology is statistically unfalsifiable. Pre-print.


Zhang, C., Smolders, K., Lakens, D., & IJsselsteijn, W. (2018). Two experience sampling studies examining the variation of self-control capacity and its relationship with core affect in daily life. Journal of Research in Personality. Preprint.

To facilitate a better understanding of the role of self-control capacity in self-control processes, we examined its variation at intraindividual and interindividual levels, and positioned it in a nomological network with core affect. In two experience sampling studies, 286 university students reported their self-control capacity and core affect for a week. Results revealed larger person-to-person than day-to-day variation in self-control capacity, while its moment-to-moment variation could be weakly modeled as a diurnal pattern. Interindividually, participants with higher self-control capacity were happier and less stressed, but intraindividually higher self-control capacity was mainly associated with higher alertness and energetic arousal. Our results imply that self-control capacity is better conceptualized as a composition of interrelated sub-constructs rather than as a unified resource.

Lakens, D., Scheel, A. M., & Isager, P. M. (in press). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. Preprint.

Psychologists must be able to test both for the presence of an effect and for the absence of an effect. In addition to testing against zero, researchers can use the Two One-Sided Tests (TOST) procedure to test for equivalence and reject the presence of a smallest effect size of interest (SESOI). The two one-sided tests procedure can be used to determine if an observed effect is surprisingly small, given that a true effect at least as large as the SESOI exists. We explain a range of approaches to determine the SESOI in psychological science, and provide detailed examples of how equivalence tests should be performed and reported. Equivalence tests are an important extension of statistical tools psychologists currently use, and enable researchers to falsify predictions about the presence, and declare the absence, of meaningful effects.

Coles, N. A., Tiokhin, L., Scheel, A. M., Isager, P. M., & Lakens, D. (in press). The costs and benefits of replication studies. Commentary on 'Making Replication Mainstream'. Behavioral and Brain Sciences. Pre-print.

The debate about whether replication studies should become mainstream is essentially driven by disagreements about their costs and benefits, and the best ways to allocate limited resources. Determining when replications are worthwhile requires quantifying their expected utility. We argue that a formalized framework for such evaluations can be useful for both individual decision-making and collective discussions about replication.

Lakens, D., Adolfi, F.G., Albers, C., Anvari, F., Apps, M.A.J., Argamon, S.E., Assen, M.A.L.M. van, Baguley, T., Becker, R., Benning, S.D., Bradford, D.E., Buchanan, E.M., Caldwell, A., Calster, B. van, Carlsson, R., Chen, S.-C., Chung, B., Colling, L., Collins, G., Crook, Z., Cross, E.S., Daniels, S., Danielsson, H., DeBruine, L., Dunleavy, D., Earp, B.D., Feist, M., Ferrell, J.D., Field, J.G., Fox, N., Friesen, A., Gomes, C., Grange, J.A., Grieve, A., Guggenberger, R., Harmelen, A.-L.V., Hasselman, F., Hochard, K.D., Hoffarth, M.R., Holmes, N.P., Ingre, M., Isager, P., Isotalus, H., Johansson, C., Juszczyk, K., Kenny, D., Khalil, A.A., Konat, B., Lao, J., Larsen, E.G., Lodder, G.M.A., Lukavsky, J., Madan, C., Manheim, D., Gonzalez-Marquez, M., Martin, S.R., Martin, A.E., Mayo, D., McCarthy, R.J., McConway, K., McFarland, C., Nilsonne, G., Nio, A.Q.X., Oliveira, C.L. de, Parsons, S., Pfuhl, G., Quinn, K., Sakon, J., Saribay, S.A., Schneider, I., Selvaraju, M., Sjoerds, Z., Smith, S., Smits, T., Spies, J.R., Sreekumar, V., Steltenpohl, C., Stenhouse, N., Świątkowski, W., Vadillo, M.A., Williams, M., Williams, S., Williams, D.R., Xivry, J.-J.O. de, Yarkoni, T., Ziano, I., Zwaan, R. (in press). Justify Your Alpha. Nature Human Behavior.

In response to recommendations to redefine statistical significance to p ≤ .005, we propose that researchers should transparently report and justify all choices they make when designing a study, including the alpha level.


Jansen, R. S., Lakens, D., & IJsselsteijn, W. (in press). An integrative review of the cognitive costs and benefits of note-taking. Educational Research Review. Pre-print.

Students frequently engage in note-taking to improve the amount of information they remember from lectures. One beneficial effect of note-taking is known as the encoding effect, which refers to deeper processing of information as a consequence of taking notes. This review consists of two parts. In the first part, four lines of research on the encoding effect are summarized: 1) manipulation of the lecture material, 2) manipulation of the method of note-taking, 3) the importance of individual differences, and 4) the testing procedure used in the empirical studies. This review highlights the fragmented nature of the current literature. In the second part of this review five forms of cognitive load that are induced by note-taking are distinguished. Cognitive load theory is used to integrate the divergent results in the literature. Based on the review, it is concluded that cognitive load theory provides a useful framework for future theory development and experimental work.

Albers, C. & Lakens, D. (2017). Biased sample size estimates in a-priori power analysis due to the choice of the effect size index and follow-up bias. Journal of Experimental Social Psychology. Pre-Print.

When designing a study, the planned sample size is often based on power analyses. One way to choose an effect size for power analyses is by relying on pilot data. A-priori power analyses are only accurate when the effect size estimate is accurate. In this paper we highlight two sources of bias when performing a-priori power analyses for between-subject designs based on pilot data. First, we examine how the choice of the effect size index (η2, ω2 and ε2) affects the sample size and power of the main study. Based on our observations, we recommend against the use of η2 in a-priori power analyses. Second, we examine how the maximum sample size researchers are willing to collect in a main study (e.g. due to time or financial constraints) leads to overestimated effect size estimates in the studies that are performed. Determining the required sample size exclusively based on the effect size estimates from pilot data, and following up on pilot studies only when the sample size estimate for the main study is considered feasible, creates what we term follow-up bias. We explain how follow-up bias leads to underpowered main studies. Our simulations show that designing main studies based on effect sizes estimated from small pilot studies does not yield desired levels of power due to accuracy bias and follow-up bias, even when publication bias is not an issue. We urge researchers to consider alternative approaches to determining the sample size of their studies, and discuss several options.

Mirabella, G., Fragola, M., Giannini, G., Modugno, N., & Lakens, D. (2017). Inhibitory control is not lateralized in Parkinson’s patients. Neuropsychologia, 102, 177–189.

Parkinson's disease (PD) is often characterized by asymmetrical symptoms, which are more prominent on the side of the body contralateral to the most extensively affected brain hemisphere. Therefore, lateralized PD presents an opportunity to examine the effects of asymmetric subcortical dopamine deficiencies on cognitive functioning. As it has been hypothesized that inhibitory control relies upon a right-lateralized pathway, we tested whether left-dominant PD (LPD) patients suffered from a more severe deficit in this key executive function than right-dominant PD patients (RPD). To this end, via a countermanding task, we assessed both proactive and reactive inhibition in 20 LPD and 20 RPD patients, and in 20 age-matched healthy subjects. As expected, we found that PD patients were significantly more impaired in both forms of inhibitory control than healthy subjects. However, there were no differences either in reactive or proactive inhibition between LPD and RPD patients. All in all, these data support the idea that brain regions affected by PD play a fundamental role in subserving inhibitory function, but do not sustain the hypothesis according to which this executive function is predominantly or solely computed by the brain regions of the right hemisphere.

Delacre, M., Lakens, D., & Leys, C. (2017). Why psychologists should by default use Welch’s t-test instead of Student’s t-test. International Review of Social Psychology. Pre-Print.

When comparing two independent groups, researchers in Psychology commonly use Student’s t-test. Assumptions of normality and of homogeneity of variance underlie this test. More often than not, when these conditions are not met, Student’s t-test can be severely biased, and leads to invalid statistical inferences. Moreover, we argue that the assumption of equal variances will seldom hold in psychological research and that choosing between Student’s t-test or Welch’s t-test based on the outcomes of a test of the equality of variances often fails to provide an appropriate answer. We show that the Welch’s t-test provides a better control of Type 1 error rates when the assumption of homogeneity of variance is not met, and loses little robustness compared to Student’s t-test when the assumptions are met. We argue that Welch’s t-test should be used as a default strategy.

Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science. DOI: 10.1177/1948550617697177 Pre-Print. R-package TOSTER. Easy-to-use Spreadsheet. Example Vignettes.

Scientists should be able to provide support for the absence of a meaningful effect. Currently researchers often incorrectly conclude an effect is absent based a non-significant result. A widely recommended approach within a Frequentist framework is to test for equivalence. In equivalence tests, such as the Two One-Sided Tests (TOST) procedure discussed in this article, an upper and lower equivalence bound is specified based on the smallest effect size of interest. The TOST procedure can be used to statistically reject the presence of effects large enough to be considered worthwhile. This practical primer with accompanying spreadsheet and R package enables psychologists to easily perform equivalence tests (and power analyses) by setting equivalence bounds based on standardized effect sizes, and provides recommendations to pre-specify equivalence bounds. Extending your statistical toolkit with equivalence tests is an easy way to improve your statistical and theoretical inferences.

Lakens, D., & Etz, A. J. (2017). Too true to be bad: When sets of studies with significant and non-significant findings are probably true. Social Psychological and Personality Science. DOI: 10.1177/1948550617693058. Pre-Print.

Psychology journals rarely publish non-significant results. At the same time, it is often very unlikely (or ‘too good to be true’) that a set of studies yields exclusively significant results. Here, we use likelihood ratios to explain when sets of studies that contain a mix of significant and non-significant results are likely to be true, or ‘too true to be bad’. As we show, mixed results are not only likely to be observed in lines of research, but when observed, mixed results often provide evidence for the alternative hypothesis, given reasonable levels of statistical power and an adequately controlled low Type 1 error rate. Researchers should feel comfortable submitting such lines of research with an internal meta-analysis for publication. A better understanding of probabilities, accompanied by more realistic expectations of what real lines of studies look like, might be an important step in mitigating publication bias in the scientific literature.

Mirabella, G., Del Signore, S., Lakens, D., Averna, R., Penge, R., and Capozzi, F. (2017). Developmental coordination disorder affects the processing of action-related verbs. Frontiers in Human Neuroscience. Open Access.

Processing action-language affects the planning and execution of motor acts, which suggests that the motor system might be involved in action-language understanding. However, this claim is hotly debated. For the first time, we compared the processing of action-verbs in children with Developmental Coordination Disorder (DCD), a disease that specifically affects the motor system, with children with a typical development (TD). We administered two versions of a go/no-go task in which verbs expressing either hand, foot or abstract actions were presented. We found that only when the semantic content of a verb has to be retrieved, TD children showed an increase in reaction times if the verb involved the same effector used to give the response. In contrast, DCD patients did not show any difference between verb categories irrespective of the task. These findings suggest that the pathological functioning of the motor system in individuals with DCD also affects language processing.


Jostmann, N. B., Lakens, D., & Schubert, T. W. (2016). A short history of the weight-importance effect and a recommendation for pre-testing: Commentary on Ebersole et al. (2016). Journal of Experimental Social Psychology. Read.

Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., Lewandowsky, S., Morey, C. C., Newman, D. P., Schönbrodt, F., Vanpaemel, W., Wagenmakers, E. J., & Zwaan, R., A. (2016). The Peer Reviewers’ Openness Initiative: Incentivising Open Research Practices through Peer Review. Royal Society Open Science, 3(1), 150547. Read.

Openness is one of the central values of science. Open scientific practices such as sharing data, materials, and analysis scripts alongside published articles have many benefits, including easier replication and extension studies, increased availability of data for theory-building and meta-analysis, and increased possibility of review and collaboration even after a paper has been published. Although modern information technology makes sharing easier than ever before, uptake of open practices had been slow. We suggest this might be in part due to a social dilemma arising from misaligned incentives, and propose a specific, concrete mechanism – reviewers withholding comprehensive review – to achieve the goal of creating the expectation of open practices as a matter of scientific principle.

Lakens, D., Hilgard, J., & Staaks, J. (2016). On the reproducibility of meta-analyses: Six practical recommendations. BMC Psychology, 4, 24. Read.

Meta-analyses play an important role in cumulative science by combining information across multiple studies and attempting to provide effect size estimates corrected for publication bias. Research on the reproducibility of meta-analyses reveals that errors are common, and the percentage of effect size calculations that cannot be reproduced is much higher than is desirable. Furthermore, the flexibility in inclusion criteria when performing a meta-analysis, combined with the many conflicting conclusions drawn by meta-analyses of the same set of studies performed by different researchers, has led some people to doubt whether meta-analyses can provide objective conclusions. The present article highlights the need to improve the reproducibility of meta-analyses to facilitate the identification of errors, allow researchers to examine the impact of subjective choices such as inclusion criteria, and update the meta-analysis after several years. Reproducibility can be improved by applying standardized reporting guidelines and sharing all meta-analytic data underlying the meta-analysis, including quotes from articles to specify how effect sizes were calculated. Pre-registration of the research protocol (which can be peer-reviewed using novel ‘registered report’ formats) can be used to distinguish a-priori analysis plans from data-driven choices, and reduce the amount of criticism after the results are known. The recommendations put forward in this article aim to improve the reproducibility of meta-analyses. In addition, they have the benefit of “future-proofing” meta-analyses by allowing the shared data to be re-analyzed as new theoretical viewpoints emerge or as novel statistical techniques are developed. Adoption of these practices will lead to increased credibility of meta-analytic conclusions, and facilitate cumulative scientific knowledge.

Lakens, D. Schubert, T. W., & Palladino, M. (in press). Social consequences of behavioral synchrony. In S. D. Obhi & E. S. Cross (Eds.), Shared representations: Sensorimotor foundations of social life. Cambridge University Press. Read.


Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). doi: 10.1126/science.aac4716. The article and 100 data and stimulus sets are here.

Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.

Lakens, D. (2015). Increased depression after the MH17 crash: How convincing is the data? American Journal of Epidemiology. Read on my blog.

Lakens, D. (2015). On the challenges of drawing conclusions from p-values just below 0.05. PeerJ, 3, e1142. Read. Materials.
In recent years researchers have attempted to provide an indication of the prevalence of inflated Type 1 error rates by analyzing the distribution of p-values in the published literature. De Winter and Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decades’ which ‘suggests (but does not prove) questionable research practices have increased over the past 25 years’. I show the changes in the ratio of fractions of p-values between 0.041-0.049 over the years are better explained by assuming the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias (or the file drawer effect) over the years (cf. Fanelli, 2012; Pautasso, 2010), which has led to a relative decrease of 'marginally significant' p-values in abstracts in the literature (instead of an increase in p-values just below 0.05). I explain why researchers analyzing large numbers of p-values need to relate their assumptions to a model of p-value distributions that takes the average power of the performed studies, the ratio of true positives to false positives in the literature, the effects of publication bias, and the Type 1 error rate (and possible mechanisms through which it has inflated) into account. Finally, I discuss why publication bias and underpowered studies might be a bigger problem for science than inflated Type 1 error rates, and explain the challenges when attempting to draw conclusions about inflated Type 1 error rates from a large heterogeneous set of p-values.

Zhang, C., Lakens, D., IJsselsteijn, W. A. (2015). The illusion of nonmediation in telecommunication: Voice intensity biases distance judgments to a communication partner. Acta Psychologica, 157, 101-105. doi:10.1016/j.actpsy.2015.02.011. For all materials and data, see

The illusion of nonmediation is an experience frequently associated with advanced media technology, such as virtual environments and teleconference systems. In this paper, we investigate whether people experience an illusion of nonmediation during interactions as simple as making a phone call. In three experiments, participants were asked to listen to someone's voice on a mobile phone (Experiment 1) or through VoIP software (Experiment 2 & 3) before guessing the location of the person, using a map. Results consistently demonstrated that louder voices were judged to be nearer, as if the technical mediation was ignored. In addition, we found that this "louder as closer" effect was stronger among more experienced users of VoIP (Experiment 2), but this moderation was not replicated in the third experiment. Implications of the results and suggestions for future research are discussed.

Santiago, J. & Lakens, D. (2015). Can conceptual congruency effects between number, time, and space be accounted for by polarity correspondence? Acta  Psychologica, 156, 179-191. doi:10.1016/j.actpsy.2014.09.016

Conceptual congruency effects have been interpreted as evidence for the idea that the representations of abstract conceptual dimensions (e.g., power, affective valence, time, number, importance) rest on more concrete dimensions (e.g., space, brightness, weight). However, an alternative theoretical explanation based on the notion of polarity correspondence has recently received empirical support in the domains of valence and morality, which are related to vertical space (e.g., good things are up). In the present study we provide empirical arguments against the applicability of the polarity correspondence account to congruency effects in two conceptual domains related to lateral space: number and time. Following earlier research, we varied the polarity of the response dimension (left-right) by manipulating keyboard eccentricity. In a first experiment we successfully replicated the congruency effect between vertical and lateral space and its interaction with response eccentricity. We then examined whether this modulation of a concrete-concrete congruency effect can be extended to two types of concrete-abstract effects, those between left-right space and number (in both parity and magnitude judgment tasks), and temporal reference. In all three tasks response eccentricity failed to modulate the congruency effects. We conclude that polarity correspondence does not provide an adequate explanation of conceptual congruency effects in the domains of number and time.

Smits, T., Lakens, D., Ritchie, S. J., & Laws, K. R. (2015). Correcting errors in Turkington et al. (2014): Taking criticism seriously. Journal of Nervous and Mental Disease, 203, 302-303. doi: 10.1097/NMD.0000000000000278.
Our continued effort to correct a table with effect sizes and their confidence intervals in Turkington et al (2014). We now provide the correct statistics based on the original data, and discuss the importance of taking criticism seriouly if we want science to be self-correcting.

Lakens, D. (2015). What p-hacking really looks like: A comment on Masicampo & Lalande (2012). Quarterly Journal of Experimental Psychology, 68, 829-832. doi: 10.1080/17470218.2014.982664. For a pre-print of the article, supplementary materials, and R scripts, see

In this comment, I take a critical look at the idea that Masicampo & LaLande provide support for a prevalence of p-values just below .05. Although I do not doubt the presence of p-hacking in the literature, I do not think M&L provide convincing support for the idea that it can be observed in a large heterogenous set of studies, nor that it should yield a large number of p-values just below .05 (i.e., in the 0.045-0.05 range). I provide a better model of the p-value distribution in the literature, and show how it leads to an (regrettably) often invisible increase in p-values across .00 to .05.


Evers, R. K. & Lakens, D. (2014). Revisiting Tversky’s Diagnosticity Principle. Frontiers in Psychology, 5:875. doi: 10.3389/fpsyg.2014.00875. The article is Open Access and available here. For all materials and data, see

Similarity is a fundamental concept in cognition. In 1977 Amos Tversky published a highly influential feature-based model of how people judge the similarity between objects. The model highlights the context-dependence of similarity judgments, and challenged geometric models of similarity. One of the context-dependent effects Tversky describes is the diagnosticity principle. The diagnosticity principle determines which features are used to cluster multiple objects into subgroups. Perceived similarity between items within clusters is expected to increase, while similarity between items in different clusters decreases. Here, we present two pre-registered replications of the studies on the diagnosticity effect reported in Tversky, 1977. Additionally, one alternative mechanism that has been proposed to play a role in the original studies, an increase in the choice for distractor items (a substitution effect, see Medin, Goldstone, & Markman, 1995), is examined. Our results replicate those found by Tversky (1977) finding an average diagnosticity-effect of 4.75%. However, when we eliminate the possibility of substitution effects confounding the results, the data provide no indication of any remaining effect of diagnosticity (0.41%, n.s.).

Lakens, D. (2014). Grounding social embodiment. Social Cognition, 32, 168-183, DOI: 10.1521/soco.2014.32.supp.168. Download.

Social embodiment research examines how thoughts, affect, and behavior is influenced by sensory, motor, and perceptual cues in the environment. It has repeatedly received criticism due to a focus on demonstration studies. Here, I aim to identify some of the possible reasons underlying the lack of theoretical progress. First, I warn against relying too strongly on inductive inferences due to the weak empirical support for social embodiment findings. Second, I will discuss two dominant theoretical frameworks in social embodiment research (conceptual metaphor theory and perceptual symbol systems theory) in light of their potential to inspire empirically testable hypotheses. Finally, I propose that one way to turn social embodiment research into a progressive research line is to integrate it more firmly with past theoretical work in social cognition, and focus on understanding the contexts in which concrete cues in the environment are salient and accessible enough to influence social inferences.

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44, 701-710. DOI: 10.1002/ejsp.2023. Download.

Running studies with high statistical power, while effect size estimates in psychology are often inaccurate, leads to a practical challenge when designing an experiment. This challenge can be addressed by performing sequential analyses while the data collection is still in progress. At an interim analysis, data collection can be stopped whenever the results are convincing enough to conclude an effect is present, more data can be collected, or the study can be terminated whenever it is extremely unlikely the predicted effect will be observed if data collection would be continued. Such interim analyses can be performed while controlling the Type 1 error rate. Sequential analyses can greatly improve the efficiency with which data is collected. Additional flexibility is provided by adaptive designs where sample sizes are increased based on the observed effect size. The need for pre-registration, ways to prevent experimenter bias, and a comparison between Bayesian approaches and NHST are discussed. Sequential analyses, which are widely used in large scale medical trials, provide an efficient way to perform high-powered informative experiments. I hope this introduction will provide a practical primer that allows researchers to incorporate sequential analyses in their research.

Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 43, 137-141. DOI: 10.1027/1864-9335/a000192. This article is Open Access and available here.

Ignoring replications and negative results is bad for science.  This special issue presents a novel publishing format – Registered Reports – as a partial solution.  Peer review occurs prior to data collection, design and analysis plans are preregistered, and results are reported regardless of outcome. Fourteen Registered Reports of replications of important published results in social psychology are reported with strong confirmatory tests. Further, the articles demonstrate open science practices such as open data, open materials, and disclosure of research process, conflicts of interest, and contributions. The credibility of published science will increase with cultural shifts that accept replications and negative results as viable research outcomes, and when transparency and reproducibility are part of standard research practice. 

Lakens, D. & Evers, E. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9, 278-292. DOI: 10.1177/1745691614528520. The article is Open Access and available here. The supplementary material is available from:

Recent events have led psychologists to acknowledge that the inherent uncertainty encapsulated in an inductive science is amplified by problematic research practices. This article provides a practical introduction to recently developed statistical tools that can be used to deal with these uncertainties when performing and evaluating research. In part 1, we discuss the importance of accurate and stable effect size estimates, and how to design studies to reach a corridor of stability around effect size estimates. In part 2, we explain how, given uncertain effect size estimates, well-powered studies can be designed using sequential analyses. In part 3, we explain what p-values convey about the likelihood that an effect is true, illustrate how the v statistic can be used to evaluate the accuracy of individual studies, and how the evidential value of multiple studies can be examined using a p-curve analysis. We end by discussing the consequences of incorporating our recommendations in terms of a reduced quantity, but increased quality of the research output. We hope the practical recommendations discussed in this article will provide researchers with the tools to make important steps towards a psychological science that allows us to differentiate between all possible truths based on their likelihood.

Smits, T., Lakens, D., Ritchie, S. J., & Laws, K. R. (2014). Statistical errors and omissions in a trial of cognitive behavior techniques for psychosis: Commentary on Turkington et al. Journal of Nervous and Mental Disease. 202, 566. doi: 10.1097/NMD.0000000000000161 You can download a pre-print HERE. See the authors response HERE.

Lakens, D. (2014). De waarschijnlijkheid van observaties. Algemeen Nederlands Tijdschrift voor Wijsbegeerte, 106, 49-53. DOI: 10.5117/ANTW2014.1.LAKE.


Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4:863. doi:10.3389/fpsyg.2013.00863. This article is Open Access, and you can download it here. The spreadsheet that you can use to calculate effect sizes can be downloaded from:

Effect sizes are the most important outcome of empirical studies. Most articles on effect sizes highlight their importance to communicate the practical significance of results. For scientists themselves, effect sizes are most useful because they facilitate cumulative science. Effect sizes can be used to determine the sample size for follow-up studies, or examining effects across studies. This article aims to provide a practical primer on how to calculate and report effect sizes for t-tests and ANOVA’s such that effect sizes can be used in a-priori power analyses and meta-analyses. Whereas many articles about effect sizes focus on between-subjects designs and address within-subjects designs only briefly, I provide a detailed overview of the similarities and differences between within- and between-subjects designs. I suggest that some research questions in experimental psychology examine inherently intra-individual effects, which makes effect sizes that incorporate the correlation between measures the best summary of the results. Finally, a supplementary spreadsheet is provided to make it as easy as possible for researchers to incorporate effect size calculations into their workflow.

Lakens, D., Fockenberg, D. A., Lemmens, K. P. H., Ham, J. & Midden, C. J. H. (2013). The evaluation of affective pictures depends on their brightness. Cognition & Emotion, 27, 1225-1246. DOI:10.1080/02699931.2013.781501 Download article. All data reported in this article can be downloaded from:

We explored the possibility of a general brightness bias: brighter pictures are evaluated more positively, while darker pictures are evaluated more negatively. In Study 1 we found that positive pictures are brighter than negative pictures in two affective picture databases (the IAPS and the GAPED). Study 2 revealed that because researchers select affective pictures on the extremity of their affective rating without controlling for brightness differences, pictures used in positive conditions of experiments were on average brighter than those used in negative conditions. Going beyond correlational support for our hypothesis, Studies 3 and 4 showed that brighter versions of neutral pictures were evaluated more positively than darker versions of the same picture. Study 5 revealed that people categorized positive words more quickly than negative words after a bright prime picture, and vice versa for negative pictures. Together, these studies provide strong support for the hypotheses that picture brightness influences evaluations.

Lakens, D. (2013). Using a smartphone to measure heart rate changes during relived happiness and anger. IEEE Transactions on Affective Computing, 4, 238-241. DOI:10.1109/T-AFFC.2013.3. You can download the article, the data, and the exercise for first-year students that was the basis for this article. 
This study demonstrates the feasibility of measuring heart rate differences associated with emotional states such as anger and happiness with a smartphone. Novice experimenters measured higher heart rates during relived anger and happiness (replicating findings in the literature) outside a laboratory environment with a smartphone app that relied on photoplethysmography.

Open Science Collaboration. (2013). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series). NY, NY: Taylor & Francis.

Fockenberg, D. A., Koole, S. L., Lakens, D., Semin, G. R. (2013). Shifting Evaluation Windows: Predictable Forward Primes with Long SOAs Eliminate the Impact of 
Backward Primes. PLoS ONE, 8(1): e54739. doi:10.1371/journal.pone.0054739. Download.

Recent work suggests that people evaluate target stimuli within short and flexible time periods called evaluation windows. Stimuli that briefly precede a target (forward primes) or briefly succeed a target (backward primes) are often included in the target’s evaluation. In this article, the authors propose that predictable forward primes act as “go” signals that prepare target processing, such that earlier forward primes pull the evaluation windows forward in time. Earlier forward primes may thus reduce the impact of backward primes. This shifting evaluation windows hypothesis was tested in two experiments using an evaluative decision task with predictable (vs. unpredictable) forward and backward primes. As expected, a longer time interval between a predictable forward prime and a target eliminated backward priming. In contrast, the time interval between an unpredictable forward primes and a target had no effects on backward priming. These findings suggest that predictable features of dynamic stimuli can shape target extraction by determining which information is included (or excluded) in rapid evaluation processes.


Koole, S. L., & Lakens, D. (2012). Rewarding replications: A sure and simple way to improve psychological science. Perspectives on Psychological Science, 7, 608-614. doi: 10.1177/1745691612462586. Download.

Although replications are vital to scientific progress, psychologists rarely engage in systematic replication efforts. The present article considers psychologists’ narrative approach to scientific publications as an underlying reason for this neglect, and proposes an incentive structure for replications within psychology. First, researchers need accessible outlets for publishing replications. To accomplish this, psychology journals could publish replication reports, in files that are electronically linked to reports of the original research. Second, replications should get cited. This can be achieved by co-citing replications along with original research reports. Third, replications should become a valued collaborative effort. This can be realized by incorporating replications in teaching programs and by stimulating adversarial collaborations. The proposed incentive structure for replications can be developed in a relatively simple and cost-effective manner. By promoting replications, this incentive structure may greatly enhance the dependability of psychology’s knowledge base.

Open Science Collaboration. (2012). An open, large-scale, collaborative effort to estimate the reproducibility of psychological science. Perspectives on Psychological Science, 7, 657-660. doi: 10.1177/1745691612462588 Download.   

Reproducibility is a defining feature of science.  However, because of strong incentives for innovation and weak incentives for confirmation, direct replication is rarely practiced or published.  The Reproducibility Project is an open, large-scale, collaborative effort to systematically examine the rate and predictors of reproducibility in psychological science. So far, 72 volunteer researchers from 41 institutions have organized to openly and transparently replicate studies published in three prominent psychological journals from 2008. Multiple methods will be used to evaluate the findings, calculate an empirical rate of replication, and investigate factors that predict reproducibility. Whatever the result, a better understanding of reproducibility will ultimately improve confidence in scientific methodology and findings.

Lakens, D., Haans, A. & Koole, S. L. (2012). Één onderzoek is géén onderzoek: Het belang van replicaties voor de psychologische wetenschap (One study is no study: The importance of replication for psychological science). De Psycholoog, September, 10-18.  Read

Recent criticisms on the way psychologists analyze their data, as well as cases of scientific fraud, have led both researchers and the general public to question the reliability of psychological research. At the same time, researchers have an excellent tool at their disposal to guarantee the robustness of scientific findings: replication studies. Why do researchers rarely perform replication studies? We explain why p-values for single studies fail to provide any indication of whether observed effects are real or not. Only cumulative science, where important effects are demonstrated repeatedly, is able to address the challenge to guarantee the reliability of psychological findings. We highlight some novel initiatives, such as the Open Science Framework, that aim to underline the importance of replication studies.

Lakens, D., Semin, G. R., & Foroni, F. (2012). But for the bad, there would not be good: Grounding valence in brightness through structural similarity. Journal of Experimental Psychology: General, 141, 584-594. doi: 10.1037/a0026468. Download.

Light and dark are used pervasively to represent positive and negative concepts. Recent studies suggest that black and white stimuli are automatically associated with negativity and positivity. However, structural factors in experimental designs, such as the shared opposition in the valence (good vs. bad) and brightness (light vs. dark) dimensions might play an important role in the valence-brightness association. In six experiments, we show that while black ideographs are consistently judged to represent negative words, white ideographs represent positivity only when the negativity of black is co-activated. The positivity of white emerged only when brightness and valence were manipulated within participants (but not between participants), or when the negativity of black was perceptually activated by presenting positive and white stimuli against a black (vs. grey) background. These findings add to an emerging literature on how structural overlap between dimensions creates associations, and highlight the inherently contextualized construction of meaning structures.

Lakens, D. (2012). Polarity Correspondence in Metaphor Congruency Effects: Structural Overlap Predicts Categorization Times for Bi-Polar Concepts Presented in Vertical Space. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38, 726-723. doi: 10.1037/a0024955. Download.

This article contains direct replications of Meier, B. P., & Robinson, M. D. (2004). Why the sunny side is up: Associations between affect and vertical position. Psychological Science, 15, 243–247. doi:10.1111/j.0956-7976.2004.00659.x, and Meier, B. P., Sellbom, M., & Wygant, D. B. (2007). Failing to take the moral high ground: Psychopathy and the vertical representation of morality. Personality and Individual Differences, 43, 757–767. doi:10.1016/j.paid.2007.02.001

Previous research has shown that words presented on metaphor congruent locations (e.g., positive words UP on the screen and negative words DOWN on the screen) are categorized faster than words presented on metaphor incongruent locations (e.g., positive words DOWN and negative words UP). These findings have been explained in terms of an interference effect: The meaning associated with UP and DOWN vertical space can automatically interfere with the categorization of words with a metaphorically incongruent meaning. The current studies test an alternative explanation for the interaction between the vertical position of abstract concepts and the speed with which these stimuli are categorized. Research on polarity differences (basic asymmetries in the way dimensions are processed) predicts that +polar endpoints of dimensions (e.g., positive, moral, UP) are categorized faster than –polar endpoints of dimensions (e.g., negative, immoral, DOWN). Furthermore, the polarity correspondence principle predicts that stimuli where polarities correspond (e.g., positive words presented UP) provide an additional processing benefit compared to stimuli where polarities do not correspond (e.g., negative words presented UP). A meta-analysis (Study 1) shows that a polarity account provides a better explanation of reaction times patterns in previous studies than an interference explanation. An experiment (Study 2) reveals that controlling for the polarity benefit of +polar words compared to –polar words did not only remove the main effect of word polarity, but also the interaction between word meaning and vertical position due to polarity correspondence. These results reveal that metaphor congruency effects should not be interpreted as automatic associations between vertical locations and word meaning, but are more parsimoniously explained by their structural overlap in polarities.


Lakens, D. (2011). Orange as a perceptual representation of the Dutch nation: Effects on perceived national identification and color evaluation. European Journal of Social Psychology, 41, 924-929.  DOI: 10.1002/ejsp.848. Read.

Although it is generally accepted that colors carry meaning, experimental research about individual, situational and cultural differences in the meaning of colors is scarce. The current research examines whether the Dutch national color functions as a perceptual representation of The Netherlands. A person dressed in orange clothing was judged to identify more with his nation compared to the same person dressed in blue (Study 1). When national identification was salient, such as during (vs. before/after) the European soccer championship, or when participants recalled an experience in which they identified (vs. not-identified) with The Netherlands, and people were more aware of the use of the color orange as a perceptual representation of The Netherlands, orange was evaluated more positively (Studies 2 and 3). Furthermore, orange evaluations correlated with self-reported national identification. These results support the hypothesis that national colors carry psychological meaning, which can influence person perception and color evaluations.

Lakens, D., Schneider, I. K., Jostmann, N. B., & Schubert, T. W. (2011). Telling things apart: The distance between response keys influences categorization times. Psychological Science, 22, 887-890. DOI:10.1177/0956797611412391. Download or Read.

For a direct replication, see Proctor, R. W., & Chen, J. (in press). Dissociating Influences of Key and Hand Separation on the Stroop Color-Identification Effect. Acta Psychologica.

People use spatial distance to talk and think about differences between concepts. Using space to think about different categories is argued to scaffold the categorization process. In the current study we investigated the possibility that the distance between response keys can influence categorization times in binary classification tasks. In line with the hypothesis that the distance between response keys can facilitate response selection in a key-press version of the Stroop task, we found that the Stroop interference effect was significantly reduced when participants performed a Stroop task with response keys far apart, compared to when participants performed a Stroop task with response keys located close together. These results support the assumption that the spatial structuring of response options facilitates categorizations that require cognitive effort, and that people can incorporate environmental structures such as spatial distance in their thought processes. Keeping your hands apart might actually help to keep things apart.

Schneider, I. K., Rutjens, B., Jostmann, N. B., & Lakens, D. (2011). Weighty matters: Importance literally feels heavy. Social Psychological and Personality Science, 2, 474-478. DOI: 10.1177/1948550610397895.

Previous work showed that concrete experiences of weight influence people’s judgments of how important certain issues are. In line with an embodied simulation account, but contrary to a metaphor enriched perspective, this works shows that perceived importance of an object influences perceptions of weight. Two studies manipulated information about a book’s importance after which participants estimated its weight. Importance information caused participants to perceive the book to be heavier. This was not merely a semantic association as weight perceptions were only affected when participants physically held the book. Furthermore, importance information influenced weight perceptions but not perceptions of monetary value. These findings extend previous research by showing that the activation direction from weight to importance can be reversed, suggesting that the connection between importance and weight goes beyond metaphorical mappings. Implications for the debate on interpretation of findings on the interplay between bodily states and abstract information processing are discussed.

Lakens, D., Semin, G. R., & Foroni, F. (2011). Why your highness needs the people: Comparing the absolute and relative representation of power in vertical space. Social Psychology, 42, 205-213. DOI: 10.1027/1864-9335/a000064. Download.

Earlier research (Schubert, 2005) has shown that power is represented in vertical space: powerful = up and powerless = down. We propose that power is not simply structured in space in absolute terms, but that relational differences in power moderate the vertical representation of the powerful above the powerless. Two studies reveal that when power differences are present (vs. absent), the vertical representation of power increases reliably. Power related words were positioned higher in vertical space (Experiments 1A and 1B), and translated above guessing average by the upper higher one of two Chinese ideographs (Experiments 2A and 2B) when power was manipulated within rather than between participants in an experimental task. These studies support the view that power relations constitute an important aspect of the vertical representation of power.

Lakens, D. (2011). High Skies and Oceans Deep: Polarity Benefits or Mental Simulation? Frontiers in Psychology, 2, 21. DOI: 10.3389/fpsyg.2011.00021. Read. See also the original article by Pecher et al (2010), the commentary by Van Dantzig & Pecher (2011), and a response by Louwerse (2011).

Pecher, Van Dantzig, Boot, Zanolie, and Huber (2010) presented targets (e.g., helicopter, submarine) up and down on a computer screen. Participants were either asked to indicate whether these objects were typically found in the ocean, or typically found in the sky. The authors examined whether congruency effects between the vertical position of words and their meaning were best accounted for by mental simulations or by polarity benefits (default asymmetries in the way people process dimensions). I believe their conclusion that polarity benefits cannot account for the interaction in reaction times between the meaning and the position of words is at best premature. Moreover, instead of explaining language understanding in terms of either simulation processes or linguistic input, a more fruitful approach might be to examine when meaning emerges from simulation processes, and when meaning is extracted from linguistic information( see Louwerse & Jeuniaux, 2010).

Lakens, D., Semin, G. R., & Garrido, M. (2011). The sound of time: Cross-modal convergence in the spatial structuring of time. Consciousness and Cognition, 20, 437-443. doi: 10.1016/j.concog.2010.09.020. Download PDF or Read.

For a replication of this study (but without the neutral words) see my thesis.

In a new integration, we show that the visual-spatial structuring of time converges with auditory-spatial left-right judgments for time-related words. In Experiment 1, participants placed past and future-related words respectively to the left and right of the midpoint on a horizontal line, reproducing earlier findings. In Experiment 2, neutral and time related words were presented over headphones. Participants were asked to indicate whether words were louder on the left or right channel. On critical experimental trials, words were presented equally loud binaurally. As predicted, participants judged future words to be louder on the right channel more often than past related words. Furthermore, there was a significant cross-modal overlap between the visual-spatial ordering (Experiment 1) and the auditory judgments (Experiment 2), which were continuously related. These findings provide support for the assumption that space and time have certain invariant properties that share a common structure across modalities.

Van Dillen, L. F., Lakens, D., & Van Den Bos, K. (2011). At face value: Categorization goals modulate vigilance for angry faces. Journal of Experimental Social Psychology, 47, 235-240. doi: 10.1016/j.jesp.2010.10.002. Download PDF or Read.

The present research demonstrates that the attention bias to angry faces is modulated by how people categorize these faces. Since facial expressions contain psychologically meaningful information for social categorizations (i.e., gender, personality) but not for nonsocial categorizations (i.e., eye-color), angry facial expressions should especially capture attention during social categorization tasks. Indeed, in three studies, participants were slower to name the gender of angry compared to happy or neutral faces, but not their color (blue or green; Study 1) or eye-color (blue or brown; Study 2). Furthermore, when different eye-colors were linked to a personality trait (introversion, extraversion) versus sensitivity to light frequencies (high, low), angry faces only slowed down categorizations when eye-color was indicative of a social characteristic (Study 3). Thus, vigilance for angry facial expressions is contingent on people’s categorization goals, supporting the perspective that even basic attentional processes are moderated by social influences. 

Lakens, D., & Stel, M. (2011). If they move in sync, they must feel in sync: Movement synchrony leads to attributed feelings of rapport. Social Cognition, 29, 1-14. doi: 10.1212/soco.2011.29.1.1. Read.

Abstract: Coordinated behavior patterns are one of the pillars of social interaction. Researchers have recently shown that movement synchrony influences ratings of rapport, and the extent to which groups are judged to be a unit. The current experiments investigated the hypothesis that observers infer a shared psychological state from synchronized movement rhythms, influencing attributions of rapport and entitativity judgments. Movement rhythms of observed individuals are manipulated between participants (Experiment 1) or kept constant while source of the emerging movement synchrony is manipulated (Experiment 2), and both rapport and perceived entitativity are measured. The findings support the assumption that movement synchrony increases attributed rapport and perceived entitativity. Furthermore, mediational analyses reveal that the effects of movement synchrony on perceived unity are not purely perceptual in nature, but caused by psychological inferences. Observers infer the degree to which individuals are a social unit from their movement rhythms. 


Lakens, D., & Ruys, K. I. (2010). The dynamic interaction of conceptual and embodied knowledge. Behavioral and Brain Sciences, 33, 449-450.

We propose the SIMS model can be strengthened by detailing the dynamic interaction between sensorimotor activation and contextual conceptual information. Rapidly activated evaluations and contextual knowledge can guide and constrain embodied simulations. In addition, we stress the potential importance of extending the SIMS model to dynamic social interactions that go beyond the passive observer.

Lakens, D. (2010). Movement synchrony and perceived entitativity. Journal of Experimental Social Psychology, 46, 701-708. doi: 10.1016/j.jesp.2010.03.015. Download PDF or Read. Download the stick figure materials used in Study 1-3 here. For the movie clips used in Study 4, send me an e-mail.

For a replication and extension, see Lakens & Stel (2011), above.

Movement synchrony has been theoretically linked to the emergence of a social unit. To empirically investigate whether similar movement rhythms are an antecedent of perceived entitativity, movement rhythms were experimentally manipulated in four studies. Using this novel approach, stick figures waving in synchrony were found to be rated higher on entitativity than stick figures waving in different rhythms (Study 1), and this effect was extended to interactional synchrony, where different movements are performed in the same rhythm (Study 2). Objective differences in movement rhythms are linearly related to ratings of perceived entitativity, and this relationship is partially mediated by the subjectively perceived similarity of movement rhythms (Study 3). These results also held for entitativity judgments for videotaped individuals waving rhythmically (Study 4). These results support the hypothesis that movement rhythms are an important source of information which observers use to infer the extent to which individuals are a social unit.


Jostmann, N., Lakens, D., & Schubert, T. W. (2009). Weight as an Embodiment of Importance. Psychological Science, 20, 1169-1174. Download PDF or Read.

Four studies show that the abstract concept of importance is grounded in bodily experiences of weight. Participants provided judgments of importance while they held either a heavy or a light clipboard. Holding a heavy clipboard increased judgments of monetary value (Study 1) and made participants consider fair decision-making procedures to be more important (Study 2). It also caused more elaborate thinking, as indicated by higher consistency between related judgments (Study 3) and by greater polarization of agreement ratings for strong versus weak arguments (Study 4). In line with an embodied perspective on cognition, these findings suggest that, much as weight makes people invest more physical effort in dealing with concrete objects, it alsomakes people investmore cognitive effort in dealing with abstract issues.