Participant responses on reproducibility

We believe perspectives from across disciplines are essential for understanding best practices for ML reproducibility. To that end, we asked participants two questions at the end of the workshop. We share their responses here.


What are the main challenges facing ML reproducibility in your field?


"There are few applications and robust datasets for smaller scientific areas like oceans data science. Data is often messy, stakeholders need information and training about ML, and there are many misconceptions about what ML can and can't do as well as how much work it is to do it properly. On top of this, stakeholders are often unsure of their goals and the tradeoffs of decisions."

Chris Whidden, Assistant Professor, Computer Science - Algorithm Engineering and ML for Oceans


"There is a lack of established standards and validation methods for the pathology AI models that are being published in the field. Domain knowledge inputs and expertise considered are minimal and non-intuitive. Additionally, the commercial "solutions" provided by vendors lack the transparency and access for assessing the real utility of the proposed solution. It seems like a money grab rather than being a true advancement in patient diagnostics."

—Rama Gullapalli, Assistant Professor, Pathology


"Little training in ML theory and methods, difficulty of accessibility to data, shortage of guidelines for standardized coding practice, lack of recurrent meetings of interdisciplinary groups to discuss any of these challenges and follow them up."

—Hugo Corona Hernández, PhD student in Linguistics at UMCG, The Netherlands


"Incentivization structure (at all levels: publications, jobs, grants, etc.) and desire to do "novel" things, versus doing fewer things, even if not as novel, to a higher degree of quality allowing to dig into the detail of what works and developing deep understanding. These systemic issues go beyond ML reproducibility, but I think has led to some of the issues here, as well as cultural aspects (it's just code, do it until it "works" / produces the desired results, etc.)."

—Taylor Johnson, Associate Professor of Computer science, Vanderbilt University


"Data sharing, code sharing (to a lesser extent), poor practice in training and validation. Data/code sharing are really basic issues but we're still not getting even this right. I don't see how more important issues will be resolved without getting data/code sharing fixed."

—Kira Mourao, Machine Learning Scientist in Biotech


"Lack of knowledge on ML, big and complex data and sometimes human data (so GDPR issues in Europe)"

—Anonymous, Bioimaging


"The initial issue with ML reproducibility is the no availability of the code to reproduce results, a big majority of cases this is not available at all. Then the data sharing of training and test sets, sometimes the training is shared but not the test set used."

Anonymous


"The variation between datasets from measurement (such as genomics or transcriptomics data) and observable traits (such as medical trait or chemical substance) that are varies per samples. We may use statistics to tackle the issue, but sometimes the samples amount (since genomics or transcriptomics quite costly) is insufficient to get significant result."

Anonymous, Bioinformatics


"Small sample sizes, data leakage, little practical utility"

—Anonymous, Computational social science


What is a possible fix that you would like to see implemented?


"One technique I didn't see mentioned in the workshop today was human-in-the-loop training. One key technique to help generate robust results, teach stakeholders and get stakeholder buy-in is to provide the ability to check and correct results (ideally which are then fed back in as training examples). Given reproducibility challenges, I suggest we try to move away from full automation where possible and encourage direct checking of results (or a sample of results) beyond looking at simple metrics."

Chris Whidden, Assistant Professor, Computer Science - Algorithm Engineering and ML for Oceans


"I think there needs to be development of genuine standards for reproducibility based on established statistical methods and techniques. Or at least, in internally consistent framework thereof. Guidelines for developing gold standards against which AI performance can be assessed yet remain to be developed. A key consideration is the lack of domain expertise in AI validations and assessments. Computer scientists are guilty of ignoring domain expert knowledge in developing AI models. This needs to change. "

—Rama Gullapalli, Assistant Professor, Pathology


"Creation of an international/inter-institutional/interdisciplinary consortium on ML and reproducibility"

—Hugo Corona Hernández, PhD student in Linguistics at UMCG, The Netherlands


"Education and training are probably the most impactful solutions. As possible field fixes, repeatability and artifact evaluation to provide peer review of code/data/etc could be tried beyond increased documentation/checklists: this has become the norm in several CS areas, and while it has a substantial overhead and its own downsides, is probably the best existing approach, as it is a form of peer review of the computational artifacts (data, code, etc.). One has to be very careful with any fixes, particularly if they become mandatory, so as not to yield unintended consequences and punitive decisions (thus disincentivizing fixes), but rather to incentivize (e.g., awards, funding, new paper/publication types, competitions/challenges, etc.) improvements beyond it just being the right thing to do and to increase confidence of results, as the current structure is not incentivizing these improvements."

—Taylor Johnson, Associate Professor of Computer science, Vanderbilt University


"I liked the model information sheets but I think policing or auditing in some way is also necessary and that requires resources. However, surely we are the point where at least basic policing could be automated? e.g. a system which scans a paper and checks it can locate the relevant data for each experiment or training run in a public repository, and that code exists in a public repository and (maybe) can be run. Combined with a model information sheet might be able to check some modelling aspects e.g. train/test/validation data splits?"

—Kira Mourao, Machine Learning Scientist in Biotech


"Have a checklist such as proposed for ML studies that people would use before starting their study, more dedicated ML training for specific topics"

—Anonymous, Bioimaging


"Availability of all the information, following a specific check list on "how to share ML models". Code, splits and sets, pre-processing steps, ML algorithms, etc."

Anonymous


"I still believe the data source should be confirmed before it becomes input for any ML pipeline, so that I think we must concentrate on data first before we do statistical analysis for the datasets."

Anonymous, Bioinformatics


"Practical implementations of ML models."

—Anonymous, Computational social science