Post date: Apr 22, 2014 7:07:24 PM
It is noteworthy that he found 44 replications that did not confirm the hypothesis of the original study – given current day concerns that replications should be published more frequently (e.g., Koole & Lakens, 2012) and the disappearance of not-significant results (e.g., Fanelli, 2010), Hanson seemed to have more research to work with in his analyses. He made seven observations which are just as relevant today, as they were more than half a century ago. It’s only one of the reasons I love reading older articles, and this one is without a doubt a hidden gem that should be known more widely.
1) Original propositions advanced with relevant evidence are more frequently confirmed in independent tests than propositions advanced lacking relevant evidence. This might seem too trivial to mention – obviously any claim (or proposition) that is supported by data is more likely to be replicated than statements that lack data. What Hanson has noticed is nevertheless still relevant today: Authors sometimes make statements that are not supported by data, but assumed to be an underlying mechanism. An example that comes to my mind is the idea that primed concepts make a construct more accessible, and the increased accessibility subsequently influences behavior. Authors might conclude that a prime has influenced the accessibility of a construct (and indeed, primes often do influence the accessibility of constructs), but if this is not demonstrated, the authors advance a proposition lacking relevant evidence. I would like to add another example of findings that I think might fit under this category, namely studies that predict a crossover interaction consisting of two simple effects, where authors observe a significant interaction, but with only one significant simple effect (while the other is not significant), and interpret the data as support for their hypothesis (sometimes both simple effects are only marginally significant, or not significant at all). This happens more than you’d like, and I believe this is also a situation where propositions are advanced while evidence is lacking (for an example where such crossover interactions that never yielded 2 significant simple effects seem to provide better support for an alternative hypothesis, see Lakens, 2012).
2) Original propositions based on a large amount of evidence are more frequently confirmed in independent tests than propositions based on a small amount of evidence. This one, we know. More data is more reliable (see Lakens& Evers, 2014, for an accessible introduction to why, and for an explanation of how to calculate the v-statistic by Davis-Stober and Dana, 2014, which can tell you when you have too little data to beat guessing average in your conclusions). What I like, is that Hanson presents this fact as an empirical reality. Nowadays, it would be impossible to not follow such a statement by the (hopefully well-understood) statistical fact that small studies are underpowered (Cohen, 1962, or Cohen, 1988). Note that with ‘small’ Hanson means studies with less than one hundreds units of observations. If we assume between-subject comparisons, that is a fair classification.
3) Source of data. Hanson’s article is published in the American Journal of Sociology. Here, he distinguishes between ‘given data’, or data already existing in databases (e.g., marriage license information), ‘contrived data’ (questionnaires, paper and pencil tests) and ‘observed data’, such as field notes. Although Hanson did not have an a-priori hypothesis, an interesting pattern was that contrived data were most reliable, followed by given data, followed by observed data. I found this interesting. It’s almost like given data, especially if it can be accessed without too much effort, affords an easy way to test an hypothesis, but if 20 people test an hypothesis on an easily available dataset, there is a higher risk of Type 1 errors.
4) Initial organization of data. He refers to data that is less more or less precise, for each of the categories under point 3. For example, in addition to field notes, observations can be collected in a structured manner in the lab. Data that is already organized is more reliable. It's a slightly less clear and thus less interesting point, I think.
5) Original propositions based on data collected under a systematic rule of selection are more frequently confirmed in independent tests than propositions based on data collected under a non-systematic selection procedure. Under ‘systematic’ selection rules, Hanson categorizes samples that were representative samples from the population. Non-systematic selection rules involve studies with convenience samples, ‘typically in the use of subjects available in college classes or in the local community.’ There might be confounds here, such as the type of research question that you would address in huge representative samples, and the questions you try to address in studies with college students, which are less risky to run. That is, this might be due to the prior probability that the examined effect is true (the lower, the more likely published findings are Type 1 errors, see Lakens & Evers, 2014). Still interesting, and deserves to be explored more, given our huge reliance on convenience samples in psychology.
6) Original propositions formulated as a result of quantitative analysis of data are more frequently confirmed in independent tests than propositions formulated as a result of qualitative analysis of data. Quantitative data, with test statistics, or qualitative data with numbers (!), were more likely to replicate than qualitative data without numbers.
7) Original propositions advanced with explicit confirmation criteria are more frequently confirmed in independent tests than propositions advanced without explicit confirmation criteria. The question here is whether the results can be expected to generalize, either because all examined instances show the proposed relation with no contradictory evidence, or (more likely) because a statistical technique is used to reject a null hypothesis at the 5 percent level of significance. Such studies are more likely to replicate (over 70%) while studies without such criteria were less likely to replicate (only 46%). This is a great reminder that you can criticize null-hypothesis significance testing all you want, and we can definitely make some improvements, but not using significance testing led to many more conclusions that were not reliable.
Overall, I think these conclusions are interesting to examine in more detail, or even replicate (!), for example in the Reproducibility Project. They might not be too surprising, but worth keeping in mind when you evaluate the likelihood that published research is true.