Blog Posts

Quantity vs quality: Many analysts vs robust analyses (27/05/2022)

This is a stream of consciousness on the topic of “many-analysts1 approaches” that I wrote in a few hours, so please excuse the typos! This is intended to discuss my thoughts on and experiences in many-analysts projects; it isn’t intended to directly reply to Wagenmakers, Sarafoglou, and Aczel (2022), though I was certainly motivated to write down my thoughts on the topic by their recent, thought-provoking article that advocated for many-analysts approaches. I would also highly recommend giving Wagenmakers et al. a read (if you haven’t already) before reading this blog post, as it may provide some context about what these approaches are, and some of the arguments in favour of them.

When I first heard about the idea of a many-analysts approach – in December 2015 when my PhD supervisor was invited to participate in a many-analysts style project for response time modelling – I immediately thought that it was a brilliant idea. In our lab we would always have long discussions about how models for each experiment should be specified (even though they were usually just variants of a standard diffusion or linear ballistic accumulator model), what methods we should use to fit the model, what robustness analyses we should perform, and so on and so forth. To me, the many-analysts approach seemed like the natural step forward for science as we know it: taking all the different ideas that different people already come up with about how to model a data set, and then seeing how they compare. However, after having been an analyst in two different many-analysts projects, having done a lot of modelling both independently and in collaborative teams, and having done a lot of thinking about the topic, I feel that the adoption of many-analysts approaches – at least in the context of applying cognitive models to data – constitutes scientific backsliding, from a focus on the quality of analyses to a focus on quantity of analysts.

My experience with many-analysts projects

Here I detail my anecdotal experiences in my first many-analysts project, with minor references to my second, and use this to showcase potential practical issues that I experienced with many-analysts approaches. These obviously don’t reflect what necessarily happens in all many-analysts approaches, or perhaps even most, but from my experiences I know that they *can* be issues, at the very least.

After being invited to participate in the many-analysts project mentioned above in mid-December 2015, my PhD supervisor offered for me to participate on the project with him (the project was invite-only, with each invitee able to ask one other person to participate with them), and I felt very fortunate to be included in this potentially ground-breaking project. However, the end of the year was fast approaching, and while we briefly chatted about the project, it wasn’t until early February 2016 that we started thinking about how to analyse the data. As the deadline for submissions was the 1st of March 2016, and we were both busy with attending a conference and several other projects during February, it meant that we weren’t able to put as much time into this project as we would in a standard modelling project. The other many-analysts project that I participated in had an even more extreme timeline, with the data being released in early-July (2017), and the analysis being due in late-August (2017). This highlights the first practical problem that I have with many-analysts projects; they can enforce fast, inflexible deadlines to complete the analyses, which might require researchers to complete the analysis within a faster timeframe than they typically would in their own research.

Regardless, we were both still determined to participate, and had big plans on how to analyse the data. Specifically, we had planned to combine visual assessments of the data (as modellers usually do implicitly, but rarely explicitly report or discuss) and our (at the time) newly developed method of estimating Bayes factors for a specific response time model. However, after doing both types of analyses on some of the data sets (there were 14 data sets in total for analysts to assess), we noticed some disagreements between our two methods. However, this was difficult to express in the results submission form that we were given, as we had to give a dichotomous, yes/no response2 on whether each data set manipulated each of four cognitive parameters (and while we were given an option for “confidence”, it was optional, and we weren’t really sure how to express this uncertainty as a single measure of “confidence”, so we left it blank). We soon decided that we had to select one analysis or the other to submit, and given that the visual assessments were already completed and the deadline was fast approaching, it was the natural choice. While the other many-analysts project that I participated in improved on this by making a confidence level a fundamental aspect of the submitted results, this isn’t how we would typically express uncertainty in our results (and it certainly didn’t feel appropriate for expressing the results of separate, potentially qualitatively differing analyses), and in the end we just made this a transformation of our Bayes factors. This highlights the second practical problem that I have with many-analysts projects; they can discourage, or even not allow for, qualitative differences in results within a single analyst, meaning that they might underestimate typical within-analyst uncertainty.

Furthermore, another reason that we decided to use the more heuristic, visual assessment approach is that we found model specification a substantially more challenging process than we have in most typical circumstances. Specifically, in most of my cognitive modelling projects, there’s a specific, theoretically relevant question that we’re interested in answering. Based on this question, as well as what model(s) we’re interested in using and what paradigms they’re most applicable to, the experimental design begins to naturally form, and well as some of the possible modelling approaches that we could take to answer the question in these data. However, in both many-analysts projects the details on the experiments were extremely vague (as intended, as they were both designed to also serve as “prediction competitions” of sorts, based on the assumed effects of specific experimental manipulations), and the questions were fairly arbitrary and atheoretical (perhaps to encourage more potential difference in methods between researchers), making it difficult to confidently design a reasonable modelling approach. For the first many-analysts project, several analyst teams reported similar feelings in their “post-hoc considerations”, and this also led to some dramatic differences in interpretations of the research question between groups. For example, while most groups believed the goal was to decide whether or not each of 4 cognitive constructs varied across conditions in each experiment, two groups believed that the goal (as it seemed more theoretically sensible) was to decide which (if any) single construct was most likely to have been manipulated in each experiment. Importantly, while these two teams showed extremely close agreement with one another across the 14 experiments, their results differed a great deal from other research teams; however, this was not a result of differences in analysis approaches per se, but different interpretations of the question of interest. This highlights the third practical problem that I have with many-analysts projects; they can ask research questions in a more vague manner than independent research teams typically would, with data that may not be the most appropriate to answer all the different interpretations of this research question.

A final reason that we picked the more heuristic analysis was that it seemed more original than our model-based analysis. When discussing how to analyse the data at the start, as well as which analysis to submit for the project, our desire was to try and represent the more qualitative aspects of assessment that form a part of typical modelling projects. We reasoned that due to the nature of the many-analysts project, most teams would likely just report the output of some model comparisons between some standard models, and that including a more heuristic method would provide some more interesting results to think about in the project (i.e., how do our heuristic intuitions about the data differ from what the models actually tell us?). After all, isn’t the point of getting many-analysts together to get as many different ideas as possible on how to assess the data? However, to our surprise, there was another analyst team that also chose a heuristic approach, meaning that heuristic approaches – which no one involved in the project would actually ever use as their primary inference approach in practice – represented more than 10% of all of the approaches used (2/17). Furthermore, when I’ve discussed how researchers are selected for invitations with organisers of invitation-based many-analysts projects, one of the criteria seemed to be avoiding inviting too many researchers who might choose similar methodologies. While I think that such an invitation approach makes sense from an “originality” angle, where the goal is to obtain as many analyses that differ from one another as much as possible, it seems to me that these recruitment strategies may artificially create more disagreement than one would typically expect to see in a more representative sample of the field, and potentially compromise inferences about the levels of disagreement. This highlights the fourth practical problem that I have with many-analysts projects; they can bias the analyses to differ to a more extreme extent than would be typically expected in the field, either through biases in analyst recruitment strategy, or a desire from analysts to come up with something original that differs from other teams.

Once we had submitted our results for the first many-analysts project, I thought that the process from there to a paper would be fairly straight forward; after all, the results were already complete, and would only need to be tabulated and written up. However, this was far from the case. Once the full experimental details were finally released, there was a lot of disagreement amongst the analysts and project leaders as to the appropriateness of the design. Some felt that the paradigm wasn’t the most appropriate one. Some felt that the manipulations weren’t the most sensible ones. Some felt that the experimental procedure and data set creation left the door open for a whole host of confounds (I was certainly in this group). This was a similar feeling to the other many-analysts project that I participated in. In each of these cases, based on the amount of disagreement between analysts on the experimental design, I began to question what could really be gained from these projects. After all, if the data used in the analyses were not truly adequate, then couldn’t disagreement between researchers be partially put down to poor data quality? In hindsight, it seems strange to me to include researchers only as “analysts”, and to not consult them in the experimental design; such a separation between experimental design and analysis design does not seem like a sensible scientific approach, but this seems to be the idea behind at least some many-analysts projects. This highlights the fifth practical problem that I have with many-analysts projects; they can defer important aspects of data acquisition to the project leaders, which may make the entire exercise meaningless.

In both many-analysts projects, another key point of disagreement was with the interpretation of the results of the study. While it may seem straightforward to simply collate the results from all of the analysts, it soon became clear that there were just as many ways to analyse the results of the analysts as there were to analyse the original data. Honestly, it felt as though we could have had a many-analysts project just on how to interpret the results of the actual many-analysts project! In both cases the project leaders decided on some specific ways to showcase and interpret the results, with some input from everyone else on the project, but I felt as though there was some bias here. Not an intentional bias or anything like that; I think that both projects were excellently run, and that all decisions made were reasonable. But I think that bias is something that can naturally occur in science, and I don’t think that the interpretation of many-analysts results are exempt from this. I think that results that show disagreement are much more exciting in many-analyst approaches (it shows the importance of the method), and results that show agreement are much more boring (others might question why one should bother using this approach), and in my opinion, I think that these biases seeped through in both of these projects, at least to some extent. This highlights the sixth practical problem that I have with many-analysts projects; they can create biases in how project leaders interpret the patterns of analyses, where project leaders might emphasise more exciting trends (i.e., aspects of disagreement), and downplay less exciting trends (i.e., patterns of agreement).

To me, one of the strangest parts of the entire first project was the review process. The paper was rejected from several initial outlets, mostly due to a lack of potential interest, but the first journal that didn’t outright reject it accepted it with essentially no revisions. This seemed strange, as very few cognitive modelling papers that I have been involved in went through the review process without *any* new analyses being suggested. However, in this instance, the paper was immediately “accepted with minor revisions” after the first round of reviews. I guess this makes sense to some extent, as the goal was to look at the variability in different researchers’ chosen analyses, but it made me feel like something was missing from the process. If we were actually trying to answer an important theoretical question with this many-analysts approach, then I would have had very low confidence in our results. While reviews can sometimes be crushing, they often suggest important potential limitations of the chosen analyses, which can lead to further robustness analyses. While they can be a lot more work, each additional robustness analysis makes me a little more confident in my results, as I feel like I’m leaving no reasonable stone unturned (at least in the mind of these reviewers). However, I did not feel that the same level of “quality assurance” was applied to any, let alone all, of the chosen analyses in the review process. I’m not sure why exactly this was the case – perhaps reviewers see some level of authority in the paper given the number of experts typically on it, or reviewers recognise the difficulty with selectively updating some analyses, or adding completely new analyses. Regardless though, this highlights the sixth practical problem that I have with many-analysts projects; they can potentially create situations where reviewers may not suggest additional analyses, providing a lessor level of “quality assurance”.

How these experiences compare to when I’m the “only-analyst”

Here I briefly discuss the differences between my many-analysts experiences and my only-analyst experiences, and why I feel that my only-analyst experiences have felt much more thorough, accurate, and robust than my many-analysts experiences.

While I detailed the issues that I experienced when participating in many-analysts projects, I think a natural question is how this compares to “only-analyst” situations. After all, while many-analysts approaches may have potential flaws, they could still be the lesser of two evils. However, when I think about that amount of analytic time and effort that I have put into my own work – as well as the amount that others in my area have put into their work – and the number of systematic robustness assessments that are commonly performed, many-analysts approaches do not feel as though they come even remotely close.

When I think about the cognitive modelling papers that I have led, and papers from many others that I have read, one image comes to my mind: extremely long papers, with sometimes even longer supplementary materials. Models upon models upon models run, all to assess if/when the results hold. Modelling research is often so meticulous, with modellers trying to systematically assess whether specific changes in the analyses will influence the results and/or conclusions of the study. Furthermore, while I try to get my analyses to run as quickly as possible, sometimes cognitive modelling can just take a lot of time, even with extremely good computational resources. While I now typically view analyses that take around a month or longer to run to be computationally infeasible, some analyses can realistically take a week or two. While this can be annoying, I think it’s an aspect of “slow and careful” science in cognitive modelling; a lot of time can often be lost in running the models, but this is a small sacrifice for quality of assessments. To me, this level of detail, commitment, and robustness is a hallmark of high-quality cognitive modelling work.

However, I had the opposite feeling when it came to the many-analysts projects that I participated in. Deadlines were tight and inflexible, particularly given that it’s usually impossible to drop everything else to immediately dedicate a large amount of time to the many-analysts project. Analysis strategies seemed more based on novelty and originality than well-researched modelling principles, and robustness checks were largely non-existent. Even when robustness checks were performed by some researchers, they were usually ignored and overlooked in place of the summary statistic that each group had to provide as their “complete” analysis. The questions asked of the analysts didn’t seem to be of key theoretical interest to anyone, and more so served a purpose of purely assessing the consistency amongst analysts. Reviewers, who are usually full of insights regarding even more robustness checks that can be performed (sometimes too much so), had mysteriously gone quiet. My many-analysts experiences didn’t feel like the more “slow and careful” science that I’m used to; it felt more like “rushed and reckless” science, with a hope that the sheer number of analysts would overcome the rushed nature of the projects. While I understand the motivation for this – these were huge projects with many teams, and performing the analyses was just the first step, meaning that allowing too long could make the project drag on for half a decade, and allowing more time for some teams than others seems unfair – it still felt like the motivation was to get results quickly, even if it was at the expense of the quality of the analyses.

Furthermore, while I feel that standard collaborative projects can often share the decision-making power of experimental design, analysis approach, and results interpretation amongst several researchers, my experience with many-analysts approaches is that they only seem to provide the illusion of joint collaborative effort, and in practice largely centralise much of the decision-making process about the experimental design and results interpretation aspects with the project leaders. When I perform a project by myself, I’m the one designing and creating the experiment, the one designing and creating the analyses, and the one interpreting the findings; importantly, all of these aspects integrate together with a certain level of synergy. When I perform a project with formal collaborators, we usually work as a team for each of these aspects of the project. Importantly, the experiment and analyses are designed to perfectly co-exist, so that the theoretical question is answered in a satisfactory manner, and the method of answering it is appropriate. However, in my experience many-analysts projects have felt like the project leaders are merely outsourcing the analysis approach aspect of the project. Given my only-analyst experiences working in situations where I was uninvolved in the experimental design, I feel that this limited involvement on the part of the analysts in problematic. When I’m not involved in the experimental design, sometimes I have to make compromises in my analyses to accommodate limitations in the data. Sometimes I can’t answer the exact research question of interest because I don’t think it’s possible with the current data. Sometimes I just get plain confused and misunderstand a fundamental aspect of the data, and then need to re-think the analyses. Not being involved in the initial design simply makes the later aspects of the research process substantially more difficult. Obviously, I accept these things as being unavoidable parts of the scientific process in certain collaborative situations, but the idea of trying to intentionally and unnecessarily create these circumstances – as well as removing the power of the overall interpretation of the results from the analysts and centralising it with the project leaders – in all situations by advocating for a general use of many-analysts approaches seems counterproductive.

One final thing that I feel greatly differs between my standard scientific experiences and my many-analysts experiences is the focus. Specifically, as discussed above, I feel like my standard experiences have focused on all aspects of the project; the experimental design (or when looking to re-analyse previous data, finding a data set with an appropriate experimental design), the analysis approach, and the results interpretation. However, my many-analysts experiences have seemed to shift all the focus of the project to the “analysis approach” aspect of things. Experimental design and results interpretation are largely viewed as objective aspects of the project with a clear solution decided by the project leaders, and the analysis approach aspect is viewed as highly uncertain, flexible, and in need of many-analysts to solve. However, I feel like the point of uncertainty and flexibility is equally true of all aspects of a project. Isn’t there sufficient uncertainty and flexibility in experimental design to warrant many-experimenters, and isn’t there sufficient uncertainty and flexibility in potential patterns of results – particularly when there are many-analysts, and therefore, a potentially greater number of different results to make sense of – to warrant many-interpreters of the results and their overall conclusions? Or is this a case of “everyone is biased, except me” on the part of project leaders? Interestingly, I feel like these many levels of uncertainty and flexibility are naturally expressed within a standard scientific literature3 – where each study involves a specific set of interconnected choices for experimental design, analysis approach, and results interpretation – but not within a many-analysts paper, at least in my experience.

A not-so-serious interpretation of the scientific literature

Here I discuss what I think seems like a more sensible alternative approach to the large-scale adoption of many-analysts approaches; evaluating the uncertainty and flexibility in all aspects of scientific studies across a scientific literature.

One intuitively weird point that I made above (it’s kind of weird even to me; it’s a conclusion that I came to while writing) is that my experiences make it seem as though the standard scientific literature does a better job of expressing the potential diversity in all aspects of a project than many-analysts projects, with the standard scientific literature providing different approaches for experimental design, analyses, and results interpretation, and many-analysts projects (in my experience) only allowing for different analysis approaches. However, while I feel as though the scientific literature expresses this uncertainty and flexibility, researchers often wish to rely on a “conclusive study” for their inferences, rather than a broader interpretation of the findings in the literature (or at least a series of studies), such as the recent “decisive” many-labs study on ego-depletion. To me, this feels like an example of researchers being uncomfortable with uncertainty in science, and the desire for each paper to provide a clear and decisive conclusion that we can blindly trust.

I do not feel that this is the way science needs to be, though. I remember at Psychonomics in Vancouver in 2017, I watched a talk by Stephan Lewandowsky about treating the academic literature as a “marketplace of ideas” (which I believe led to this paper). While I wouldn’t say that I agree with everything in the talk or the paper (though for those who know me, I think I rarely completely agree with anything), one idea strongly resonates with me: why do we feel the need to take published papers so seriously, and treat every piece of literature as conclusive fact? Given how we currently seem to treat the scientific literature, where a paper is viewed as conclusive fact by its proponents, and hot garbage by its critics, I can see why many-analysts approaches seem necessary; it might be the only way to try and find a conclusive result across all possible analysis approaches within a single paper. But why does this level of conclusiveness need to be achieved within a single paper? To me, the issue isn’t with the lack of analytic conclusiveness in each paper, which many-analysts approaches attempt to solve; it’s the desire for conclusiveness in each paper, which scientists seem to relentlessly seek. Overall, I feel as though many-analysts approaches are an impressive feat of over-engineering to try and create certainty within a single, centralised paper, when a simpler and more efficient approach seems as though it would be to take single papers – or even series of papers by the same authors – a little less seriously.


Final thoughts

Just to briefly reflect on my overall opinion after getting my thoughts out, both for myself and any potential readers.

While I can see some potential utility in many-analysts projects, I do not think that they are the scientific messiah that some have claimed them to be. Specifically, in the context of my research field (cognitive modelling), I feel as though many-analysts approaches sacrifice the quality of the analysis of each individual for a larger number of analysts. While some may see this as a good sacrifice, and I can see contexts outside of my research field where this might be the case, I think that this is a bad sacrifice in the context cognitive modelling, and the large-scale adoption of many-analysts approaches would lead to a less robust literature. Personally, I would much prefer to make my conclusions based on a range of papers with single, different analysts, rather than a single paper with many-analysts, and I think that making interpretations based on “many-papers” is a substantially more robust way of making conclusions about the state of scientific literature than a single paper with “many-analysts”.

Footnotes

1: Wagenmakers et al. (2022) use the term “multi-analyst” for these approaches. However, I use the term “many-analysts” throughout, as I think that it better reflects my experience in these types of projects, where the focus has been on recruiting a large number of analysts, rather than just having more than one analyst.

2: Technically there were three options, as researchers had to state the direction of any effects. However, there were very few instances of researchers selecting the wrong direction, so it was essentially a dichotomous response as to whether there was an effect or not.

3: While one could argue that the scientific literature commonly has issues, such as publication bias and p-hacking, other research practices have been suggested to try and counteract these issues that do not require many-analysts, such as registered reports.