Wat zijn dat toch voor waanideeën, dat je, verdomd, in een gedicht 'de dingen van je af kunt schrijven'? 

Jean Pierre Rawie - No Second Troy

Moved Blog

posted May 26, 2014, 10:57 PM by Daniel Lakens   [ updated May 26, 2014, 10:57 PM ]

I've moved my blog. Old posts will remain here, but if you want to read up on more recent blogs, see:

Characteristics of Reliable Research in Social Sciences (Hanson, 1958).

posted Apr 22, 2014, 12:07 PM by Daniel Lakens   [ updated Apr 26, 2014, 11:19 PM ]

The question when research is likely to be true is interesting, relevant, and challenging for any researcher. This is as true today, when researchers in psychology are working on the Reproducibility Project, as it was in 1958. In this year, Robert C. Hanson examined this question by looking at 60 replication articles (several consisting of more than a single replication study), and analyzed whether they successfully or unsuccessfully replicated the original 60 studies, as a function of a range of characteristics of the original study. You might call it a Reproducibility Project avant la lettre. You can read the paper here (free if you register a JSTOR account). It's cited 13 times, according to Google Scholar. Give me a chance to correct this historical oversight in this blog post.

It is noteworthy that he found 44 replications that did not confirm the hypothesis of the original study – given current day concerns that replications should be published more frequently (e.g., Koole & Lakens, 2012) and the disappearance of not-significant results (e.g., Fanelli, 2010), Hanson seemed to have more research to work with in his analyses. He made seven observations which are just as relevant today, as they were more than half a century ago. It’s only one of the reasons I love reading older articles, and this one is without a doubt a hidden gem that should be known more widely.

1)      Original propositions advanced with relevant evidence are more frequently confirmed in independent tests than propositions advanced lacking relevant evidence. This might seem too trivial to mention – obviously any claim (or proposition) that is supported by data is more likely to be replicated than statements that lack data. What Hanson has noticed is nevertheless still relevant today: Authors sometimes make statements that are not supported by data, but assumed to be an underlying mechanism. An example that comes to my mind is the idea that primed concepts make a construct more accessible, and the increased accessibility subsequently influences behavior. Authors might conclude that a prime has influenced the accessibility of a construct (and indeed, primes often do influence the accessibility of constructs), but if this is not demonstrated, the authors advance a proposition lacking relevant evidence. I would like to add another example of findings that I think might fit under this category, namely studies that predict a crossover interaction consisting of two simple effects, where authors observe a significant interaction, but with only one significant simple effect (while the other is not significant), and interpret the data as support for their hypothesis (sometimes both simple effects are only marginally significant, or not significant at all). This happens more than you’d like, and I believe this is also a situation where propositions are advanced while evidence is lacking (for an example where such crossover interactions that never yielded 2 significant simple effects seem to provide better support for an alternative hypothesis, see Lakens, 2012).

2)      Original propositions based on a large amount of evidence are more frequently confirmed in independent tests than propositions based on a small amount of evidence. This one, we know. More data is more reliable (see Lakens& Evers, 2014, for an accessible introduction to why, and for an explanation of how to calculate the v-statistic by Davis-Stober and Dana, 2014, which can tell you when you have too little data to beat guessing average in your conclusions). What I like, is that Hanson presents this fact as an empirical reality. Nowadays, it would be impossible to not follow such a statement by the (hopefully well-understood) statistical fact that small studies are underpowered (Cohen, 1962, or Cohen, 1988). Note that with ‘small’ Hanson means studies with less than one hundreds units of observations. If we assume between-subject comparisons, that is a fair classification.

3)      Source of data. Hanson’s article is published in the American Journal of Sociology. Here, he distinguishes between ‘given data’, or data already existing in databases (e.g., marriage license information), ‘contrived data’ (questionnaires, paper and pencil tests) and ‘observed data’, such as field notes. Although Hanson did not have an a-priori hypothesis, an interesting pattern was that contrived data were most reliable, followed by given data, followed by observed data. I found this interesting. It’s almost like given data, especially if it can be accessed without too much effort, affords an easy way to test an hypothesis, but if 20 people test an hypothesis on an easily available dataset, there is a higher risk of Type 1 errors.

4)      Initial organization of data. He refers to data that is less more or less precise, for each of the categories under point 3. For example, in addition to field notes, observations can be collected in a structured manner in the lab. Data that is already organized is more reliable. It's a slightly less clear and thus less interesting point, I think.

5)      Original propositions based on data collected under a systematic rule of selection are more frequently confirmed in independent tests than propositions based on data collected under a non-systematic selection procedure. Under ‘systematic’ selection rules, Hanson categorizes samples that were representative samples from the population. Non-systematic selection rules involve studies with convenience samples, ‘typically in the use of subjects available in college classes or in the local community.’ There might be confounds here, such as the type of research question that you would address in huge representative samples, and the questions you try to address in studies with college students, which are less risky to run. That is, this might be due to the prior probability that the examined effect is true (the lower, the more likely published findings are Type 1 errors, see Lakens & Evers, 2014). Still interesting, and deserves to be explored more, given our huge reliance on convenience samples in psychology.

6)      Original propositions formulated as a result of quantitative analysis of data are more frequently confirmed in independent tests than propositions formulated as a result of qualitative analysis of data. Quantitative data, with test statistics, or qualitative data with numbers (!), were more likely to replicate than qualitative data without numbers.

7)      Original propositions advanced with explicit confirmation criteria are more frequently confirmed in independent tests than propositions advanced without explicit confirmation criteria. The question here is whether the results can be expected to generalize, either because all examined instances show the proposed relation with no contradictory evidence, or (more likely) because a statistical technique is used to reject a null hypothesis at the 5 percent level of significance. Such studies are more likely to replicate (over 70%) while studies without such criteria were less likely to replicate (only 46%). This is a great reminder that you can criticize null-hypothesis significance testing all you want, and we can definitely make some improvements, but not using significance testing led to many more conclusions that were not reliable.


Overall, I think these conclusions are interesting to examine in more detail, or even replicate (!), for example in the Reproducibility Project. They might not be too surprising, but worth keeping in mind when you evaluate the likelihood that published research is true. 

Why Can’t We All Just Get Along? Comment on Cumming (2014) and Morey et al., (2014)

posted Mar 19, 2014, 1:34 AM by Daniel Lakens   [ updated Apr 26, 2014, 11:19 PM ]

The blog post below just passed initial review as a comment submitted to a journal. In the e-mail that notified me of this decision, it stated I will learn the results from the review process in 6 to 8 weeks. I know that’s relatively fast, but Twitter and blog posts have spoiled me. The internet reminds me every day that my goal is to communicate with my fellow researchers, and published journals are only a, but not necessarily the best, way to do this. It might still appear in a journal, or not, it might be improved by peer review, or not, and you may like it as it is, or not (for comments, talk to me @Lakens).


Why Can’t We All Just Get Along? Comment on Cumming (2014) and Morey etal., (2014)

Readers are likely more familiar with articles that criticize null hypothesis significance testing (NHST) than with articles in support of NHST (e.g., Frick, 1996; Mogie, 2004; Wainer & Robinson, 2003). Articles that question the status quo are bound to receive more attention than more nuanced calls for a unified approach to statistical inferences (e.g., Berger, 2003). This paints a biased picture of disagreement, with a focus on those aspects of statistical techniques in which one approach outperforms another, instead of stressing the relative benefits of using multiple procedures, and teaching individuals how to improve the inferences they draw. For example, two major criticisms against NHST (that the null is never true, and that NHST promotes dichotomous thinking) are easily solved by acknowledging that, even though an effect is often trivially small, it is never exactly 0. Therefore, a statistical test has three possible interpretations (e.g., Jones & Tukey, 2000) by indicating a positive difference, a negative difference, or by indicating the direction of the effect remains undetermined:

1.      µ1 - µ2 > 0

2.      µ1 - µ2 < 0

3.      µ1 - µ2 is undetermined

Other examples of how NHST can be improved are testing hypotheses against minimal (instead of null) effects (see Murphy & Myors, 1999), or using sequential analyses to repeatedly analyze accumulating data (while controlling Type 1 error rates) until the results are sufficiently informative (see Lakens, in press). The lack of attention for such straightforward improvements is problematic, especially since neither confidence intervals nor Bayesian statistics provide full-proof alternatives.

Confidence Intervals

Researchers should always report confidence intervals (CI’s). As Kelley and Rausch (2006) explain, it is misleading to report point estimates without illustrating the uncertainty surrounding the parameter estimate. However, the information expressed by a CI is perhaps even less intuitive than the use of conditional probabilities such as p-values, and might even be more widely misunderstood (see Hoekstra, Morey, Rouder, & Wagenmakers, in press).

As long as selective reporting of performed experiments persists (both through publication bias as through the selection of ‘successful’ studies by individual researchers) confidence intervals in the published literature will be difficult to interpret. For example, although 83.4% (or 5 out of 6) of replication studies will give a value that falls within the 95% CI of the original study, this is only true if the study was one of an infinite sequence of unbiased studies. Given the strong indications of publication bias in psychology (Fanelli, 2010), the correct interpretation of confidence intervals from the published literature is always uncertain. Researchers have proposed a ban on p-values for less problematic issues.

Bayesian Statistics

Using Bayesian statistics has many benefits (see Morey, Rouder, Verhagen, & Wagenmakers, 2014). Researchers can make statements about the probability a hypothesis is true given the data (instead of the probability of the observed or more extreme data given a hypothesis and alpha level), provide support for the null-hypothesis (Dienes, 2011), and analyze data repeatedly as the data comes in. These are important benefits, and justify a more widespread use of Bayesian statistics in psychological research. Bayesian statistics are less interesting when Bayes factors are used as a replacement of p-values. When a uniform prior is used, differences between Frequentist inferences and Bayesian inferences are not mathematical, but philosophical in nature (Simonsohn, 2014).

Whenever an informative prior is used, the assumptions about the theory that is tested will practically always leave room for subjective interpretations. For example, Dienes (2011) and Wetzels, et al. (2011) both drew different assumptions about the same theory that was tested and calculated Bayes factors of 4 (substantial evidence for the theory over the null) and 1.56 (barely evidence for the theory over the null), respectively. Based on the psychological literature, we should expect these subjective assumptions to be biased by researchers’ attitudes. Addressing this challenge is not easy, and researchers have proposed a ban on p-values for less problematic issues.

One Statistic To Rule Them All?

Neither reliance on Bayesian statistics, confidence intervals, or p-values will be sufficient to prevent unwise statistical inferences. As an imaginary example, let’s pretend the evaluation of this blog by 10 of my Bayesians colleagues was substantially less positive (M = 7.7, SD = 0.95) than the evaluation by 10 of my Frequentist colleagues (M = 8.7, SD = 0.82). This difference is statistically significant, t(18) = 2.58, p = .02, and neither the 95% CI around the effect size (dunb = 1.08, [0.16, 2.06], see Cumming, 2014) nor the 95% highest density interval ([0.05, 1.99], see Kruschke, 2011) included 0. Nevertheless, concluding there is something going on would be premature. The v-statistic (Davis-Stober & Dana, 2014) which compares a model based on the data against a model based on random guessing reveals that due to the extremely small sample size, random guessing will outperform a model based on the data 68% of the time (for details, see Lakens & Evers, in press). There will never be a single statistical procedure that will tell us everything we want to know with adequate certainty.

If statisticians had intentionally tried to induce learned helplessness and an escape to dichotomous conclusions based on oversimplified statistical inferences, they could not have done a better job than through the continued disagreement about how to draw statistical inferences from observed data. One might wonder what the practical significance of statisticians is, if they fail to provide “a concerted professional effort to provide the scientific world with a unified testing methodology” (Berger, 2003, p. 4). At the same time, any researcher who unquestioningly believes a p < .05 indicates an effect is likely to be true should be blamed for not spending more time learning statistics. In the end, improving the way we work will only succeed as a collaborative effort relying on a multi-perspective approach.


Cumming, G. (2014). The new statistics: Why and how. Psychological Science, doi: 10.1177/0956797613504966

Davis-Stober, C. P., & Dana, J. (2014). Comparing the accuracy of experimental estimates to guessing: a new perspective on replication and the “Crisis of Confidence” in psychology. Behavior Research Methods. DOI 10.3758/s13428-013-0342-1

Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science,6, 274 –290. doi:10.1177/1745691611406920

Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences. PloS one, 5, e10068.

Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (in press). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review.

Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5, 411-414.

Kelley, K., & Rausch, J. R. (2006). Sample size planning for the standardized mean difference: accuracy in parameter estimation via narrow confidence intervals. Psychological Methods, 11, 363-385.

Kruschke J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312. doi:10.1177/1745691611406925

Lakens, D. (in press). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology.

Lakens, D. & Evers, E. (in press). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science.

Mogie, M. (2004). In support of null hypothesis significance testing. Proceedings of the Royal Society of London Series B-Biological Sciences, 271: S82–S84.

Morey, R. D., Rouder, J. N., Verhagen, J., & Wagenmakers, E.-J. (in press). Why hypothesis tests are essential for psychological science: A comment on Cumming. Psychological Science.

Murphy, K. R., & Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234-248.

Simonsohn, U. (2014). Posterior-hacking: Selective reporting invalidates Bayesian results also. Available at SSRN:

Wainer, H., & Robinson, D. H. (2003). Shaping up the practice of null hypothesis significance testing. Educational Researcher, 32, 22-30.

Wetzels R., Matzke D., Lee M. D., Rouder J. N., Iverson G. J., & Wagenmakers E.-J. (2011). Statistical evidence in experimental psychology. Perspectives on Psychological Science, 6, 291–298.

More Data Is Always Better, But Enough Is Enough

posted Mar 8, 2014, 12:39 AM by Daniel Lakens   [ updated Mar 8, 2014, 1:32 AM ]

Several people have been reminding us that we need to perform well powered studies. It’s true this is a problem, because low power reduces the informational value of studies (a paper Ellen Evers and I wrote about this, to appear in Perspectives on Psychological Science, is available here). If you happen to have a very large sample, good for you. But here I want to prevent people from drawing the incorrect reverse inference that the larger the sample size you collect, the better. Instead, I want to discuss when it’s good enough.

I believe we should not let statisticians define the word ‘better’. The larger the sample size, the more accurate parameter estimates (such as means and effect sizes in a sample). Although accurate parameter estimates are always a goal when you perform a study, they might not always be your most important goal. I should admit, I’m a member of an almost extinct species that still dares to publically admit that I think Null-Hypothesis Significance Tests have their use. Another (deceased) member of this species was Cohen, but for some reason, 2682 people cite his paper where he argues against NHST, and only 20 people have ever cited his rejoinder where he admits NHST has its use.

Let’s say I want to examine whether I’m violating a cultural norm if I walk around naked when I do my grocery shopping. My null hypothesis is no one will mind. By the time I reach the fruit section, I’ve received 25 distressed, surprised, and slightly disgusted glances (and perhaps two appreciative nods, I would like to imagine). Now beyond the rather empty statement that more data is always better, I think it would be wise if I get dressed at this point. My question is answered. I don’t exactly know how strong the cultural norm is, but I know I shouldn’t walk around naked.

Even if you are not too fond of NHST, there are times when your ethical board will stop you from collecting too much data (and rightly so). We can expect our participants to volunteer (or perhaps receive a modest compensation) to participate in scientific research because they want to contribute to science, but their contribution should be worthwhile, and balanced against their suffering. Let’s say you want to know whether fear increases or decreases depending on the brightness of the room. You put people in a room with either 100 or 1000 lux, and show them 100 movie clips from the greatest horror films of all time. Your ethical board will probably tell you that the mild suffering you are inducing is worth it, in terms of statistical power, from participants 50 to 100, but not so much for participants 700 to 750, and will ask you stop when your data is convincing enough.

Finally, imagine a tax payer who walks up to you, hands you enough money to collect data from 1000 participants, and tells you: “give me some knowledge”. You can either spend all the money to perform one very accurate study, or 4 or 5 less accurate (but still pretty informational) studies. What should you do? I think it would be a waste of the tax payers money if you spend all the money on a single experiment.

So, when are studies informational (or convincing) enough? And how do you know how many participants you need to collect, if you have almost no idea about the size of the effect you are investigating?

Here’s what you need to do. First, determine your SESOI (Smallest Effect Size Of Interest). Perhaps you know you can never (or are simply not willing to) collect more than 300 people in individual sessions. Perhaps your research is more applied, and allows for a cost benefit analysis that requires an effect is larger than some value. Perhaps you are working in a field that does not simply exist of directional predictions (X > Y) but allows for stronger predictions (e.g., your theoretical model predicts the effect size should lie between r = .6 and r = .7).

After you have determined this value, collect data. After you have a reasonable number of observations (say 50 in each condition) analyze the data. If it’s not significant, but still above your SESOI, collect some more data. If (say after 120 participants in each condition) the data is significant, and your question is suited for a NHST framework, stop the data collection, write up your results, and share them. Make sure that, when performing the analyses and writing up the results, you control the Type 1 error rate. That’s very easy, and is often done in other research areas such as medicine. I’ve explained how to do it, and provide step-by-step guides, here (the paper will very likely appear in the European Journal of Social Psychology – just submitted some final corrections). If you prefer to reach a specific width of a confidence interval, or really like Bayesian statistics, determine alternative reasons to stop the data collection, and continue looking at your data until your goal is reached.

The recent surge of interest in things like effect sizes, confidence intervals, and power is great. But we need to be aware, especially when communicating this to researchers who’ve spend less time reading up on statistics, that we tell them they should change the way they work, without telling them exactly how they should change the way they work. Saying more data is always better might be a little demotivating for people to hear, because it means it is never good enough. Instead, we need to help people to make it as easy as possible to improve the way they work, by giving advice that is as concrete as possible.



How a Twitter HIBAR ends up as a published letter to the editor

posted Feb 16, 2014, 10:53 PM by Daniel Lakens   [ updated Apr 26, 2014, 11:20 PM ]

I’d like to share a fun Twitter discussion I had a month ago, and which lead to a letter to the editor that will be published in the near future in the Journal of Nervous and Mental Disease. One the 14th of January, Keith Laws posted the following question on Twitter:

My response, underneath Keith’s message in the same screenshot, was that there was clearly something wrong with this table:

Keith found this table in a paper by Douglas Turkington and colleagues which reported an exploratory trial for cognitive behavioral techniques. At that moment, I didn’t even follow Keith on Twitter, but my Twitter buddy Åse (who wrote a blog post about the paper here: did, and if she shares something, it’s almost always interesting (#FF @asehelene). I don’t know much about the topic of cognitive behavioral therapy, but I have some interest in calculating and reporting effect sizes, so I’d thought I’d take a look.

Pretty soon, Tim Smits joined in:

as well as Stuart Richie:

As you can see, we were both joking around, as well as being amazed how this level of statistics reporting made it through peer review. In more formal terms, what we did was post-publication peer review (PPPR), or talking about the paper as a HIBAR (Had I Been A Reviewer). Note that Table 2 is all we have: there are no test statistics, means can only be gauged from Tables (that sometimes have no labels), and the 95% CI and error bars that are reported cannot possibly be correct.

Tim Smits (who had written a similar letter to the editor for another paper) took the lead and drafted a first letter. After we all added our comments, we submitted the letter (check out Tim’s blog for more about the content of the letter, as well as some things that we excluded from the letter for brevity’s sake), only to receive the message that our letter was accepted for publication a few days later. The original authors will probably publish a rejoinder (obviously, we are all very curious).

Now I understand that getting criticism on your work is never fun. In my personal experience, it very often takes a dinner conversation with my wife before I’m convinced that if people took the effort to criticize my work, there must be something that can be improved. What I like about this commentary is that is shows how Twitter is making post-publication reviews possible. It’s easy to get in contact with other researchers to discuss any concerns you might have (as Keith did in his first Tweet). Note that I have never met any of my co-authors in real life, demonstrating how Twitter can greatly extend your network and allows you to meet interesting and smart people who share your interests. Twitter provides a first test bed for your criticisms to see if they hold up (or if the problem lies in your own interpretation), and if a criticism is widely shared, can make it fun to actually take the effort to do something about a paper that contains errors.

It might be slightly weird that Tim, Stuart, and myself publish a comment in the Journal of Nervous and Mental Disease, a journal I guess none of us has ever read before. It also shows how Twitter extends the boundaries between scientific disciplines. This can bring new insights about reporting standards  from one discipline to the next. Perhaps our comment has made researchers, reviewers, and editors who do research on cognitive behavioral therapy aware of the need to make sure they raise the bar on how they report statistics (if only so pesky researchers on Twitter leave you alone!). I think this would be great, and I can’t wait until researchers from another discipline point out statistical errors in my own articles that I and my closer peers did not recognize, because anything that improves the way we do science (such as Twitter!) is a good thing.


posted Jan 28, 2014, 12:21 AM by Daniel Lakens   [ updated May 12, 2014, 6:11 AM ]

When we want to run well-powered experiments, we need to estimate the effect size we expect to observe. One source of information can be related articles published in the literature. Sometimes you need to convert between a reported effect size (e.g., r) and the effect size you need for your power analysis (e.g., Cohen’s d). Since not everyone reports effect sizes, sometimes you’ll need to know how to calculate effect sizes based on the test statistics (t and F values) or perhaps even only based on the p-value and N. I created an effect size conversion spreadsheet, From_R2D2 that can help you with this. You can download it from the Open Science Framework: There are several other resources to convert effect sizes online, but I still thought I could add a little to existing options for the following reasons.


1)      It’s called From_R2D2. Nuff said.

2)      I tried to make a spreadsheet that clearly tells you what you are getting out, depending on what you are putting in, and to prevent people from making conversions that are not correct. Some spreadsheets do not clearly differentiate between Cohen’s ds and Cohen’s dpop which annoys me, even though the difference is often small. I like adjustments for bias (such as r_adjusted instead of r), and if you are using a spreadsheet anyway, it’s not any more work, so it makes sense to use adjusted effect sizes.

3)      You don’t have to know what you can calculate based on what you have. You just put in the information you have available, and the spreadsheet will turn green if it could calculate anything. So, it requires less knowledge of what you can calculate – you just give in as much info as possible, and the program will let you know what you can get out of it.

4)      From_R2D2 helps you to calculate the most accurate effect sizes. For example, if you have the number of participants in each condition, it will give back a more accurate conversion than when you only have the total N, and provides clear pointers what to use in the form of tooltips.

5)      From_R2D2 is also useful if you have a within design. Although you need the correlation or SD’s to calculate Cohen’s d_av or d_rm (see Lakens, 2013), it’s relatively straightforward to calculate r (based on the fact that in both within as between designs, the relation between a t-value and F-value is F = t*t). At the same time, it will help to prevent conversions that do not make sense (e.g., converting r to dpop in a within design).

6)      From_R2D2 also provides the common language effect size. It’s a pretty interesting way to interpret Cohen’s d, and I’ll continue to try to make it more accessible for people (see also Lakens, 2013).

7)      You read point 1, right?

Perhaps you already have a conversion tool that works for you, but perhaps you don’t and you find my spreadsheet useful. It complements my other spreadsheet and article on calculating effect sizes if you have the raw data (and which allows you to calculate cool things such as generalized eta squared). Together with Paul Turchan and Andy Woods we are working on a free iPhone and Android app that will allow you to perform these and other calculations, but this is a weekend project so it will take some weeks before that is ready.

I recently wrote a blog post about the replicability of published research, and how it is a characteristic of the data, and not a characteristic of the researcher. In that post, I hinted there are ways you can get an idea of the likelihood a study will replicate. To perform the calculations necessary to get an idea of this likelihood, we’ll need r_adjusted, which you can calculate with this spreadsheet. Yes, there is an order to this madness.

Failed Replications, Probabilities, and Tarnished Reputations

posted Jan 24, 2014, 1:05 AM by Daniel Lakens   [ updated Apr 26, 2014, 11:22 PM ]

What is the likelihood a novel published finding will replicate in a high-powered replication study? And what does success or failure tell us about the researcher who performed the original experiment?

Obviously, we would love to know the probability that novel findings will replicate (i.e., if they are true in the population), but this probability is very difficult to quantify based on a single study. A formally correct answer is: “We won’t know until we try”. Even though we can never know for sure, the probability that a published finding will replicate differs depending on characteristics of the study and the prior likelihood that the hypothesis is true. An important characteristic is the sample size of the original study. In general (but not necessarily!) the larger the sample, the more reliable the effect size estimate. This means that results from smaller studies are more likely to be false positives (see Ioannidis, 2005). This is a basic statistical fact.

There are some people who think that when researcher B fails to replicate researcher A, the reputation of researcher A is tarnished. How could this be true, if the likelihood that a finding replicates is a characteristic of the data, and not of the researcher? I have been pondering this question, and can see two logical flaws people might make.

First, people who think failed replications affect the reputation of the researcher fail to understand basic statistics. Not every finding that is significant should replicate (even though some people still believe that if a study is significant at the .05 level, it should replicate 95% of the time). If you live in a magical world were all significant findings should replicate, a failed replication must mean Researcher A has done something that gave the data cooties, since we all know that only research that has cooties does not replicate, and giving your data cooties is bad for your reputation.

Instead, we should accept and understand that in academia, researchers submit, reviewers recommend to accept, and editors decide to publish articles that from the very outset have a high probability not to replicate (Wait – you said you couldn't quantify the probability something replicates?! I know – I’ll explain in a later blog how you can get at least some ideas about this probability).

Second, people might think that a researcher who submits an article for publication has established the reliability of this finding in separate, not reported studies. This is an interesting perspective, but I think it is unlikely that published studies with tiny samples (e.g., 20 participants per between subject condition) were preceded by studies with very large samples, which a researcher nevertheless decided not to publish. So, for the published studies least likely to replicate, the assumption that researchers established the reliability of these findings in pilot studies with 4 or 5 times the sample size of the published study seems improbable.

If someone believes a failed replication tarnishes the reputation of an individual, then logically this person should believe that publishing a study that is relatively unlikely to replicate (e.g., with a medium effect size and 20 participants in each condition, testing an a-priori unlikely idea) should tarnish a researchers reputation. After all, the probability that a finding replicates is fixed at the moment this data is published. Therefore, I don’t understand why you would question a researchers reputation when a replication fails, but not when the replication succeeds, and I’m pretty sure academia would be a lot more fun, and slightly more professional, if we would all relate the success or failure of a replication to characteristics of the data, and not to characteristics of the researcher.

The Problem in Behavioral Priming Research is Equal Parts of Theory and Data

posted Jan 16, 2014, 2:17 AM by Daniel Lakens   [ updated Apr 26, 2014, 11:21 PM ]

In this recent article ( Ap Dijksterhuis welcomes back theory in research on behavioral priming. Some have argued that social priming research lacks a strong theoretical foundation (e.g., Cesario, 2014, in the same Perspectives on Psychological Science issue, which is regrettably not open access). I am actually pretty impressed with early theoretical work on social priming, and recently re-read a lot of papers by Wyer, Srull, Smith, Carlston, Hastie, and others who provided social priming research with a very solid foundation in the early eighties. It is a real shame that this work has been almost completely ignored in the last 20 years, or at least has seen practically no progress. So yes, welcome back, theory!

Here, I want to focus on a problem that is just as big, but much less appreciated: Lack of data. It seems some people do not really appreciate how very, very little data they have to build theories on. In the article by Dijksterhuis, it is clear that there are 2 things he fails to appreciate. The first is the file-drawer problem. The second is the reliability of the available empirical support for social priming. This becomes clear from the following quote (Dijksterhuis, 2014, p 73):

“It is interesting that the few published nonreplications have led some to suggest that behavioral priming may not exist. However, there are good reasons to believe that the fear that psychology is infested with false positives is largely unnecessary (Dalton, Aguinis, Dalton, Bosco, & Pierce, 2012; Murayama, Pekrun, & Fiedler, in press); in the case of behavioral priming, the hundreds of papers cannot be erased by the mere flick of a skeptic magic wand, no matter how hard you try.”

Let me explain why this statement is wrong, and in my view represents a fundamental misunderstanding of the current state of psychological science (or any other discipline that relies on statistical inferences and suffers from publication bias).

1 The file-drawer: No biggie?

First of all, I encountered the references to Dalton, Aguinis, Dalton, Bosco, & Pierce (2012), and Murayama, Pekrun, & Fiedler (in press) in a submission for a special issue of Social Cognition I reviewed (I was asked to submit a manuscript about embodiment research for the same special issue, which is currently under editorial consideration, perhaps more on that later). In this manuscript by John Bargh (cited by Dijksterhuis, 2014) the same reference was used to downplay the file-drawer problem. I tried to prevent these claims from getting into the published literature in my review of the manuscript (which I signed, as I always do), but since the argument is now part of the published record through the publication by Dijksterhuis, I will share my criticisms on this line of reasoning here.

The work by Dalton et al (2012) states that the file drawer problem does not pose a serious threat to the validity of meta-analytically derived conclusions (it appeared in a journal called 'Personnel Psychology' - I wonder how it would be treated in a methods journal). This must seem complete nonsense to anyone familiar with the file-drawer problem, and it obviously is. The conclusion by Dalton et al cannot be applied to social priming research, because Dalton et al (2012) focus on nonexperimental research (more specifically, correlation matrices), and there is no reason to assume their conclusions generalize to experimental research, where a single test of an hypothesis is significant or not, and is only published in the first, but not the latter instance (unlike correlation matrices). It might be interesting to read how the article is discussed by others. For example, Kepes & McDaniel (2013) write, in a footnote:

We acknowledge that Dalton, Aguinis, Dalton, Bosco, and Pierce (2012) concluded that publication bias is not worrisome in our literature. However, that paper stands alone in that conclusion. We note that the Dalton et al. (2012) effort differed from all other published publication bias analyses in that did not examine any specific research topic in the literature (Kepes, McDaniel, Brannick, & Banks, 2013). As such, we do not find it an informative contribution to the publication bias literature (i.e., publication bias is concerned with the availability of effect sizes on a particular relation of interest).

Although I would have liked to read in the footnote that the conclusion does not generalize to experimental research (instead of the focus on a specific relation of interest), these authors understand that the article stands alone in its conclusion. It might hold for correlation matrices in some sub disciplines, but the conclusion does not hold for social priming research. In my view, the citation of Dalton et al. in the article by Dijksterhuis is bad scholarship.

Similarly, referring to Muryama et al as support for the fact that the number of Type 1 errors is not a big problem is not really fair. As Muryama et al state: “We do not by any means intend to argue that current research in social psychology is this healthy and has sufficient self-cleansing capabilities” p. 1). The research practices discussed by Muryama (replications, a-priori hypotheses, etc) WOULD reduce Type 1 error IF they are used – and we should interpret the likelihood that there are Type 1 errors in a research area based on the extent that these practices have been used. In social priming research (as in most other domains), there are no pre-registered hypotheses, very few close replications, and no way to estimate the number of failed experiments (but see the p-curve section below, for a key to the file-drawer). This citation seems to suggest the article by Muryama argues all is well in social psychology – but nothing could be further from the truth.

2 Erasing Social Priming by the Flick of a P-curve.

Dijksterhuis correctly notes that you cannot erase hundreds of papers by the flick of a wand. Indeed, magic is not real. Luckily for researchers, p-curve analyses (Simonsohn, Nelson, & Simmons, 2013) are real. Let’s imagine social priming studies have observed significant effects with the following p-values:

p = .032, p = .001, p = .021, p = .045, p = .012, p = .002, p = .028, p = .038, p = .016, p = .044, p = .015, p = .023, etc.

Anyone familiar with the distribution of p-values under the null-hypothesis will see what I was typing in: a completely uniform distribution of p-values. If the null-hypothesis is true, every p-value is equally likely. If all significant studies that reveal social priming effects have a distribution that is uniform, the studies would lack evidential value. We would have to treat social priming as a research area that consists primarily (there might be some exceptions) as research findings that represent selection bias (either of possible tests performed on the same set of data, or of the datasets themselves). It’s not the flick of a wand, but it’s much better: It’s a thorough understanding of statistics, applied at a meta-level, to draw inferences from published findings, without being limited by publication bias. If the p-values would consist of a lot of p-values between .00 and .01, and only very few p-values between .04 and .05, social priming would show evidential value.

What would a p-curve reveal? I started p-curving social embodiment findings (closely related to social priming, and many studies are done by authors who have also published social priming studies) and based on an analyses of 100 effects from over 40 published articles, I am not optimistic (but more on that later as well). I will share my analysis soon, so you can judge the results for yourself, but when it comes to studies that examine effects of concrete primes on behavior, decisions, evaluations, etc, I think there is reason to worry. If the 100 effects in my analyses can, taken together, lack evidential value, so can hundreds of social priming studies. Never underestimate the number of Type 1 errors hundreds of researchers can produce if they apply themselves to a specific research area for more than a decade. Do the math.

The empirical basis of social priming research is not as strong as authors such as Dijksterhuis and Bargh think. With an increase in pre-registered replications, combined with meta-analytic analyses that are not influenced by publication bias such as p-curve analyses, we will start to get a much more realistic view on the reliability of our research, in social priming, and beyond.

See also:

Dalton, D. R., Aguinis, H., Dalton, C. M., Bosco, F. A., & Pierce, C. A. (2012). Revisiting the file drawer problem in meta-analysis: An assessment of published and nonpublished correlation matrices. Personnel Psychology, 65, 221-249

Dijksterhuis, A. (2014). Welcome back theory! Perspectives on Psychological Science, 9, 72-75.

Simonsohn, U., Nelson, L., & Simmons, J. (2013). P-curve: A key to the file drawer. Journal of Experimental Psychology: General.


De grootste innovatie in de wetenschap komt van beter management

posted Oct 12, 2013, 4:40 AM by Daniel Lakens

Deze blog verschijnt ook in TH&MA, het vakblad voor hoger onderwijs: In dit blad zal ook een reactie op onderstaand artikel verschijnen van NWO.
Het doel van wetenschap is om robuuste kennis te vergaren. De samenleving staat een deel van haar geld af aan de wetenschap om nieuwe technologieën te ontwikkelen, te leren hoe we mensen gezond kunnen houden of krijgen, te begrijpen hoe onze wereld werkt, en hoe interventies veranderingen teweeg brengen. Als de samenleving een televisie koopt die het niet doet, dan kunnen mensen dat zelf makkelijk vaststellen, en het apparaat terugbrengen om hun geld terug te vragen. Bij de wetenschap kan een burger geen geld terugvragen, en vaak is het erg moeilijk om in te schatten of de wetenschap het wel goed doet. Wetenschappers, en de mensen die het beleid rondom wetenschappelijk onderzoek vormgeven, hebben dus een grote verantwoordelijkheid om er zorg voor te dragen dat de wetenschap robuuste kennis vergaart.
Laten we aannemen dat wetenschappers intrinsiek gemotiveerd zijn om robuuste kennis te vergaren. Vanzelfsprekend is deze motivatie slechts één van de factoren die hun uiteindelijke gedrag bepaalt. Zo willen veel wetenschappers betaald krijgen voor hun werk. Er is niet genoeg geld om iedereen die in de wetenschap wil werken aan te nemen, en dus moet het geld verdeeld worden onder die wetenschappers  die het beste in staat zijn om robuuste kennis te genereren. Wetenschapsbeleid is feitelijk niets meer dan dit relatief simpele probleem zo op te lossen dat de samenleving het meest tevreden is.
Het huidige wetenschapsbeleid loopt op twee punten spaak. Het eerste probleem is dat wetenschapsbeleid zich richt op uitmuntende individuele onderzoekers, terwijl wetenschap per definitie een coöperatieve onderneming is. Het tweede probleem is dat innovatief onderzoek intrinsiek motiverend is, en dus altijd gedaan zal worden door wetenschappers, maar dat de robuustheid en toepasbaarheid van dit onderzoek per definitie nog relatief onzeker is. Je moet wetenschappers stimuleren om onderzoek te doen dat belangrijk is, maar minder intrinsiek belonend, zoals het repliceren van voorgaand onderzoek, en het valoriseren van kennis.

De onzichtbare voordelen van samenwerking

Beleidsmakers zijn, net als de rest van de mensheid, zelden creatief genoeg om na te denken over oplossingen waarvoor men buiten de kaders moet denken. De bekendste illustratie hiervan is het probleem om negen stippen te verbinden door vier rechte lijnen te trekken, zonder je pen op te tillen. Het probleem is dat mensen zichzelf beperkingen opleggen die er niet zijn. De rechte lijnen mogen best wat uitsteken voorbij de stippen in de 4 hoeken. Als managers wetenschap willen leiden is het intuïtief om binnen de huidige competitieve kaders te blijven denken. Binnen die kaders speel je wetenschappers tegen elkaar uit door die paar die meer presteren dan de rest het meest te belonen. Als een persoon een 10 scoort, en de rest een 9, dan geef je de persoon met de 10 de centjes. Simpel.
Het is moeilijker in te zien dat er systemen zijn waarbij de kwaliteit van wetenschap niet van 1 tot 10 loopt, maar van 1 tot 20. Toch is dit het verschil tussen de huidige wetenschap waarin competitie centraal staat, en waar we collectief niet hoger zullen scoren dan een 10, en een systeem waarin coöperatie gestimuleerd wordt, waar wetenschappers opeens veel hoger kunnen scoren. Laat ik een voorbeeld geven dat een peuter kan begrijpen. Je gaat verhuizen, en je wilt je bank meenemen. Die kun je in je eentje niet dragen. De oplossing is dat je de bank laat staan, en na je verhuizing een nieuwe koopt. Natuurlijk is die oplossing niet zo efficiënt. Het is niet voor niets dat mensen hun vrienden bellen als ze gaan verhuizen, en samen de bank meenemen. In een verhuis systeem waarin competitie centraal staat, kopen mensen alleen kleine spulletjes om in hun huis te zetten. In een wetenschappelijk systeem waarin competitie centraal staat, pakken onderzoekers alleen kleine probleempjes aan om te bestuderen.
Als je naar een wetenschapssysteem toe wil waar we echte prestaties neerzetten, hebben individuele beloningen en beurzen geen zin. Ze beperken wetenschappers in de problemen die aangepakt worden. Roepen dat mensen in hun eentje excellent zijn (waarbij we de vraag of excellentie gemeten kan worden op een manier waarbij er een één-dimensionele rangorde uitkomt gemakshalve maar even negeren), is je ogen sluiten voor het feit dat de echte vraag is hoe excellent mensen kunnen samenwerken. Stel dat samenwerken de kwaliteit met 20% verhoogt. Dan wil een slimme manager dus twee negens aannemen die kunnen samenwerken (samen scoren ze immers 21,6), in plaats van twee tienen die alleen individueel onderzoek kunnen doen (en blijven steken op 20). Instellingen zoals NWO moeten daarom alleen geld voor promotieplaatsen geven aan twee negens die kunnen samenwerken, en nooit meer aan één tien die dat niet kan.
Er zijn in onze wereld nog een paar mensen die het leuk vinden om anderen te helpen, zonder daar direct iets voor terug te krijgen. Deze uitstervende diersoort is essentieel voor de kwaliteit van de wetenschap. Ze leggen anderen uit hoe je onderzoek beter kan doen, zonder dat ze direct coauteur willen worden op een artikel dat uit het onderzoek volgt. Stel dat zo’n persoon de kwaliteit van onderzoek met 50% kan verhogen, maar zelf helemaal nooit iets produceert. In een wereld waar managers hun werk goed deden, zouden universiteiten vechten om zo’n persoon. In de praktijk kunnen managers de 50% verbetering die deze persoon bewerkstelligt niet kwantificeren. Tijdens het jaarlijkse voortgangsgesprek bestaat er een reëel risico dat deze persoon doorgezaagd wordt over hoe weinig er afgelopen jaar gepubliceerd is, of dat er nog steeds geen beurs binnengesleept is. Het verdwijnen van dit type medewerker is slechts één van de belangrijke factoren die de kwaliteit van het wetenschapsbedrijf verminderen. Als samenwerking beloont werd zou dit voorkomen kunnen worden. Hoewel het niet mijn taak is om managers te helpen om hun werk goed te doen, kan ik een simpele oplossing voorstellen. Je vraagt alle medewerkers om in te schatten hoeveel een collega die geen coauteur is heeft geholpen om de kwaliteit van hun onderzoek te verbeteren. Deze score kan tijdens een voortgangsgesprek gebruikt worden om het werk van een onderzoeker te beoordelen.
Wat moeten managers managen? 
Innovatief onderzoek is het makkelijkste type onderzoek om te doen als onderzoeker. Niet in termen van de tijd en inzet die er nodig is om innovatief onderzoek te doen. Het kost veel tijd, je moet veel en bij voorkeur een brede kennis hebben, en je weken of maanden in een probleem vastbijten. Maar de motivatie die wetenschappers voor dit type onderzoek hebben is zo ongelofelijk hoog dat een beetje onderzoeker het zou blijven doen, zelfs als ze er niet voor betaalt krijgen (denk aan alle hoogleraren die met emeritaat zijn maar gewoon door blijven werken). De enige beperking om innovatief onderzoek te doen is tijd (en een beetje creativiteit bezitten). Om innovatief onderzoek te stimuleren, kun je het beste twee dingen doen. Als eerste neem je mensen aan op basis van de creativiteit van hun voorgaand werk, en misschien op basis van de uitkomst van enkele psychologische testen of individuele gesprekken. Als tweede ontsla je alle managers, en gebruikt dat geld om de onderzoekers zo veel mogelijk tijd te geven. Het geld kan bijvoorbeeld gebruikt worden om het secretariaat uit te breiden, zodat administratief werk voor onderzoekers tot een minimum beperkt blijft. Daarna komt het vanzelf goed. Topinstituten brengen meer Nobelprijswinnaars voort, naarmate er minder gemanaged wordt.

Ik moet toegeven dat deze laatste stelling niet gebaseerd is op wetenschappelijk onderzoek. Ik vind dat ergens wel passend, omdat het meeste wetenschapsbeleid niet gebaseerd is op wetenschappelijk onderzoek. Zo is de hoogte van de beurzen die NWO uitdeelt is nu niet gebaseerd op hoeveel geld de meeste kennis oplevert. Dat weet NWO namelijk niet. Als we de Higgs boson kunnen ontdekken, zou je denken dat we ook wel uit kunnen vinden hoe geld over onderzoekers verdeeld moet worden zodat ze optimaal efficiënt zijn. Dat de financiering van wetenschappelijk onderzoek niet gebaseerd is op empirische data is even ironisch als problematisch. Maar neem de opmerking over het verband tussen de hoeveelheid managers en wetenschappelijke prestaties dus met een korreltje zout.

Het is echter wel waarschijnlijk dat alles dat managers doen uiteindelijk innovatief onderzoek tegen zal werken (tenzij ze samenwerking mogelijk maken, en wetenschappers zo veel mogelijk tijd geven). De reden is dat het doen van innovatief onderzoek de natuurlijkste staat van het dynamische wetenschapssysteem is. Het zijn punten binnen het wetenschappelijke landschap waar onderzoekers door hun intrinsieke motieven naar toe stromen. Bijna alles dat nu gedaan wordt in het kader van het managen van wetenschap werkt feitelijk als het bouwen van dammen binnen dit landschap. Mensen belonen op aantallen publicaties? Een blokkade waar onderzoekers omheen moeten op weg naar hun doel.
Wat managers wel moeten managen, is de robuustheid en toepasbaarheid van onderzoek. Net zoals de meeste mensen liever voor de tv gaan zitten dan dat ze gaan sporten, doen de meeste wetenschappers liever nieuwe dingen die directe beloningen opleveren, dan dat ze aan de conditie van die nieuwe kennis werken. Laat ik twee voorbeelden geven. Ten eerste doen wetenschappers uit zichzelf te weinig het onderzoek van anderen nog eens over om de robuustheid ervan te controleren. Dit replicatie onderzoek wordt door wetenschappers zelf te weinig gewaardeerd, maar is wel belangrijk. Net zoals mensen die alleen maar voor de tv zitten en niet sporten, heeft de wetenschap met alleen maar nieuwe resultaten zonder replicatie onderzoek ook grote kans op ziektes. Replicatie onderzoek komt echter zelden in een tijdschrift met een hoge impact factor. Managers zouden dus vooral moeten stimuleren dat wetenschappers vaker doorbouwen op bestaand onderzoek, in plaats van innovatief onderzoek stimuleren. Voor de duidelijkheid: Dat doe je dus niet door belang te hechten aan de impact factor van het blad waar een onderzoeker in publiceert.
Ten tweede betaalt de samenleving wetenschappers voor kennis, en moet in ieder geval een deel van die kennis ook bruikbaar zijn om het leven van mensen iets beter te maken. Vaak betekent dit dat de uitkomsten van onderzoek toegepast moeten worden. Om te garanderen dat onderzoek toepasbaar is, moet vaak veel meer onderzoek gedaan worden om alle kleinere details uit te werken, dan dat er tijd nodig is om innovatief onderzoek te doen. Het is over het algemeen veel moeilijker om de laatste stappen in een valorisatie traject te zetten, dan de eerste stappen. De eerste ‘innovatieve’ stappen worden echter veel meer beloond door managers. Onderzoekers die het echte werk doen binnen de valorisatie van kennis, doen dan niet om hun eigen carrière te verbeteren, maar uit overtuiging. In de tijd dat je één veldstudie doet, kun je makkelijk vier experimenten in het lab doen. Zolang managers geen onderscheid maken in de hoeveelheid werk die een publicatie kost, vermindert de kans dat wetenschappers de moeite nemen om te onderzoeken hoe kennis toegepast kunnen worden. Managers moeten dus differentiëren in hoe prestaties gemeten worden. Een goede veldstudie is voor het uiteindelijke doel van de wetenschap misschien wel meer waard dan 10 lab experimenten.



We mogen trots zijn op de Nederlandse wetenschap. We hebben veel intrinsiek gemotiveerde onderzoekers die niets liever doen dan nieuwe dingen ontdekken, en daar heel goed in zijn. Tegelijkertijd kan het altijd beter. Ik heb hier drie problemen geschetst die een robuuste en excellente wetenschap in de weg staan. Als eerste belonen we individuele excellentie, terwijl we excellente samenwerking zouden moeten stimuleren. Als tweede wordt er gemanaged op innovatief onderzoek, terwijl dat juist het natuurlijke, intrinsiek motiverende gedrag is dat Nederlandse wetenschappers zullen laten zien als je ze met rust laat. Management moet zich richten op het verzekeren van de robuustheid van onderzoek, en de praktische toepassing van resultaten. Als laatste moeten de aanpassingen van de beloningsstructuren binnen de wetenschap gebaseerd zijn op wetenschappelijke inzichten. Als je mensen beloont met extra geld of het vooruitzicht op een vaste aanstelling, dan moeten de beloningsstructuren zo inrichten dat ze leiden tot de meeste robuuste nieuwe kennis. Hoe je dat bereikt is een empirische vraag, waarvan het antwoord gevonden moet worden in data. Het zou zo maar kunnen dat de grootte innovatie binnen de Nederlandse wetenschap bereikt gaat worden door de beloningsstructuren opnieuw vorm te geven.

1-10 of 26