The Reproducibility Crisis in ML‑based Science

July 28, 2022

10AM–4:30 PM ET


The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there's a reproducibility crisis brewing. Indeed, we found 20 reviews across 17 scientific fields that find errors in a total of 329 papers that use ML-based science.

Hosted by the Center for Statistics and Machine Learning at Princeton University, our online workshop aimed to highlight the scale and scope of the crisis, identify root causes of the observed reproducibility failures, and make progress towards solutions.

We have made the workshop materials public: the talks and slides below, and the annotated reading list.

Talks and slides

Background on the workshop and the crisis

Arvind Narayanan, Princeton University (7 minutes)

Leakage and the reproducibility crisis in ML-based science

Sayash Kapoor, Princeton University (7 minutes)

Overly optimistic prediction results on imbalanced data

Gilles Vandewiele, Ghent University (20 minutes)

Is the ML reproducibility crisis a natural consequence?

Michael Roberts, University of Cambridge (20 minutes)

Towards a definition of reproducibility

Odd Erik Gundersen, NTNU (20 minutes)

How to avoid machine learning pitfalls: a guide for academic researchers

Michael Lones, Heriot-Watt University (20 minutes)

Consequences of reproducibility issues in ML research and practice

Inioluwa Deborah Raji, University of California Berkeley (20 minutes)

When (and why) we shouldn't expect reproducibility in ML-based science

Momin M. Malik, Mayo Clinic (20 minutes)

The replication crisis in social science: does science self-correct?

Marta Serra-Garcia, University of California San Diego (20 minutes)

Integrating explanation and prediction in ML-based science

Jake Hofman, Microsoft Research (20 minutes)

The worst of both worlds: a comparative analysis of errors in learning from data in psychology and machine learning

Jessica Hullman, Northwestern University (20 minutes)

What is your estimand? Implications for prediction and machine learning

Brandon Stewart, Princeton University (20 minutes)

Reading list and interactive session

In addition to the public session on July 28th, we also prepared additional content for participants who are interested in going deeper into reproducibility:

  • Annotated reading list: We prepared a reading list with relevant research on reproducibility from the last few years. The majority of these papers were presented by speakers at the workshop. The list is meant to be an accompanying resource for participants who want to go deeper into reproducibility.

  • Tutorial and interactive session on July 29th, 3-4:30 PM ET: In a recent preprint, we (Kapoor and Narayanan) introduced model info sheets for improving reproducibility by detecting and preventing leakage. In our testing so far, users have been able to detect leakage in models they previously built by filling out model info sheets.

On the day after the workshop (July 29th, 3-4:30 PM ET), we gave a brief tutorial on how model info sheets can help you prevent leakage in your own research, and then hosted an interactive session.


Sayash Kapoor | Ph.D. candidate, Princeton University

Priyanka Nanayakkara | Ph.D. candidate, Northwestern University

Kenny Peng | Incoming Ph.D. student, Cornell University

Hien Pham | Undergraduate student, Princeton University

Arvind Narayanan | Professor of Computer Science, Princeton University

Questions? Contact