During the panels, speakers answered several questions from the moderators and the audience. Below, we include a lightly-edited transcript of the Q&A from all 3 panels.

Panel 1 Q&A: Diagnose

Moderator: Priyanka Nanayakkara

Panelists: Gilles Vandewiele, Michael Roberts, Odd Erik Gundersen


Priyanka Nanayakkara: Could each of you describe your journey into researching reproducibility topics and ml-based science and how you first became interested in this area?


Gilles Vandewiele: It was rather by coincidence. I would say reproducibility was never a predefined goal to look into, but it was rather in the context of this project where we found that the reported scores were a ballpark away from what we were achieving and so when investigating deeper, we found mistakes in this other research. From that point onwards, reproducibility has always sparked my interest.


Odd Erik Gundersen: I was attending this keynote at the national conference for case-based reasoning about reproducibility. Afterwards, a few researchers and I went up and talked to the speaker, and we immediately thought: We have to try to help our community—the case-based reasoning community—to do better. We looked into how to improve our processes. We presented it at the next year’s conference, but no one was really interested because everyone understood it would require more of them to publish at ICVR. We don’t have enough time to do the research really really well, so I thought I had to push elsewhere.


Michael Roberts: I started in mathematical modeling of images, and then started to move into using machine learning. But I really started to hate what was existing in the field. There were so many issues with reproducibility. Code that I found just didn't work or didn't generalize.


I really got exposed to this when we started research medical imaging for COVID. I’d never had that much exposure to so many code bases and papers discussing the same type of models. We have massive issues with reproducibility, just trying to get some of them to work on basic data.


And then we started to explore clinical data and where else machine learning was going to be used. It felt like we had to break the system to start to rebuild it again, before we could do any COVID modeling ourselves.


Priyanka Nanayakkara: What do you see as some of the primary challenges or barriers for avoiding common failure modes in reproducibility in ml-based science.


Odd Erik Gundersen: So machine learning is so complex, algorithms are complex, experiments are complex, and also computers are complex. All of this complexity means that it's really hard to find what is causing the results.


That is why it's good that we focus on ml-based science here, because we need to explain all of this complexity and that it's not that easy to just run an experiment and look at results and be happy, because software bugs could lead you to a different result, but you will never know that, because you expect that the software from someone else works, but it might not.


Michael Roberts: There are significant training gaps between groups. There's a complete wild west of techniques, and there’s not one concrete way that every group follow. I think everyone kind of knows that that's wrong and there's an acceptance in the community that something is strange.


The pressure to publish is huge and that's causing this problem of having spent six months on a project and then feeling that you need to turn a positive result out of it.


It could be better to spend two years designing something that works perfectly, but you have to have the belief of your research group and UPI and everyone around you, to let you have that freedom to work for so long on just one problem.


Gilles Vandewiele: From my side, especially within the medical domain, it is often hard to reproduce studies because the data is mostly very sensitive, especially with the GDPR and so on.


Code is often not made available as well and then you need to go based on the description within a paper. And then, if you obtain a result that is significantly different than the one in the paper, you're never sure that it's you who made the mistake in your own implementation or whether the original authors actually made that mistake.



Priyanka Nanayakkara: Gilles, there was one question in the chat for you. Isn't there a potential data leakage problem when extracting features before splitting into train and test sets? How did you avoid that?


Gilles Vandewiele: Indeed there is potential data leakage when calculating features, but maybe before we go into that deeper, we should make a distinction between supervised and unsupervised features. If you're not using any label information to calculate your feature—as a very trivial example, you take the second power of a certain number—you can safely reapply these functions to both your training and your testing data.


On the other hand, the oversampling procedure uses label information. To give another example, there's a technique called target encoding, in which you process and categorical information by filling it in with the mean or the mode for a certain category. That's where label leakage can occur.


So when you're calculating these types of features, it's very important to make sure that you never use label information from your test set. A good rule of thumb is to always try to think about how this model would work in production so that if a new data sample comes in, can you calculate all these features without knowing the label of that sample.



Priyanka Nanayakkara: Okay, we also have a specific question for Odd-Erik. The participant says there is some evidence that science is getting harder to read, and they cite A 2017 paper by Ball in Nature. They ask, do you see this growing inaccessibility of the narrative to be an impediment to reproducibility and, if so, how would you go about fixing this?


Odd Erik Gundersen: Our research is becoming more and more complex. But we cannot explain everything. And the more complex our research is, the harder it is to explain.


I totally agree that this is a problem, and as much as possible, this means we need to be more open and more transparent. Open code and data happens more and more, and I think this is the right way to go, because we are not able to capture everything in the document itself.



Priyanka Nanayakkara: yeah that transitions really well into one of the audience questions. One participant is curious to hear the presenters thoughts on setting up a dedicated agency or office of transparency, like NIH, DARPA, or NHS.


Michael Roberts: I definitely think there should be some check-and-balances and reward structure for people that do really good research but get negative results, because that doesn't exist at the moment—and also a way of acknowledging that some research is published and gets great acclaim but is not reproducible.


I don't know if the answer is to set up independent agencies. I can only work on a base level and say that my PhD students have to do stuff in a certain way, and we enforce that they develop models in a careful reproducible way and test every small component. As we develop models, checklists are used.


Journals enforcing checklists when you publish is incredibly important, because you'll start to realize that if you write your paper and submit it and then ask for a checklist and you've not done it, you have to redo so much of your research.


So I wonder whether it's more that at a lot of levels we need to enforce good practices, and we kind of fix it at both ends of the pipeline.


Gilles Vandewiele: Indeed, I think, responsibility for reproducibility can be put on many places. First of all, the authors are quite responsible, but definitely the journal on its own is responsible as well.


And I can understand that by letting this be performed by a third party could potentially increase trust, but on the other hand, the public themselves can kind of check how well a journal, for instance, puts effort into these reproducibility aspects so you have to say that a third party agency would really be needed.


It would be ideal to have as many efforts as possible, but I think the first effort should be taken by the journals.


Odd Erik Gundersen: Yeah, so I think first of all it's it's important understand that science progresses through papers and errors, so I don't think we can or should strive towards removing all false results in publications. But we can be better certainly—accordingly a 2005 paper, 50 to 70% of all research findings are false, and that is too much, of course—so we have to do something.


I think that the problem here is we as researchers have different incentives. We need to publish. But, of course, there are other funders they could require open data and code. Journals could also make these requirements. We can also hire people based on how good their research is: are they pulling good practices, are they doing good methodology, not only, are they publishing at at the top venues.


Science also has this publication bias where more positive results are accepted than negative results. So if you have negative results it's really hard to publish. And even if you reproduce something, then the journal would say you're not doing new research. I think it is possible to close this feedback loop, when we have the Internet and everything.



Priyanka Nanayakkara: Great Okay, so I think we have time for just one last question. One person asks, Should the number of trial runs also be published as part of ml studies? Could you discuss what you think would be important to include in studies so that reproducibility efforts can be made possible? Another participant asked, When you analyze published papers regarding reproducibility or leakage do you contact the authors first asking for code and data?


Gilles Vandewiele: So in my case, I did contact each of the authors, but never got a reply in many weeks, so that's why I decided to implement it.


And regarding the number of trial runs, it's definitely a fact that you can.overfit on your cross validation results as well. If you throw enough stuff to the wall something will stick. So it would be ideal if there are just no test labels provided in the data as well, so that it becomes more difficult to overfit on the specific cross validation set you are using. But I think it would already be a good step forward to report the number of trial runs of how many things you've tried in a paper.


Odd Erik Gundersen: All the top AI conferences require you to say how many runs you did and provide the central tendency of the result, as well as the variation, because they always are stochastic. For example, the random seed can drastically affect results.


Michael Roberts: I think that's all correct. The idea that you don't say how many experiments you ran, or how many times you repeated the analysis just feels like you're hiding something from the audience in a way.


But, again, there is absolutely no motivation for you doing it unless you're forced to do it. If I did two runs and one of them is great, and one of them's rubbish, why would I report the rubbish one if i'm trying to get into the best journal.


Odd Erik Gundersen: I just wanted to add that this publication bias could be solved by accepting register reports, where you describe the type of experiment you want to run, and a journal accepts it beforehand, regardless of it ends up having positive or negative results.


Michael Roberts: Yeah, and by preregistering your study, when you publish the review, you say exactly what study number, you had so that people can check that you didn't change your conclusions or changing methodology, because you were tempted to go in a particular direction.

Panel 2 Q&A: Fix

Moderator: Sayash Kapoor

Panelists: Michael Lones, Inioluwa Deborah Raji, Momin M. Malik, Marta Serra-Garcia


Sayash Kapoor: So, first of all I want to thank you all so much for such a wide range of topics that we discussed, and we also got to go quite a bit into the depth of these topics so thanks a lot for all of your presentations.


My first question is directed towards Marta and Michael. Both of you have looked at how hype can lead to issues for reproducibility. What is the hardest part of changing scientific practices?


Michael Lones: What a good question. I think it's making people aware that they're they're not doing things the right way. I think a big part of the issue is that that people don't necessarily have much education in machine learning before they start practicing machine learning.


And things that often seem like good ways of approaching a problem have pitfalls that they're just not aware of. And there's this question of how you can make me aware of these things and I don't really know what the answer is. I just put this guide together and make it available to people read it, but whether that happens or not, who knows.


Marta Serra-Garcia: I agree with that. There are also incentives, and so we get new and interesting ideas.


There is an incentive, in order to get a paper published, to say that my result is more right than it actually is. During the review process, as reviewers it may help us to say that this is a great research idea and that we don’t have to have 100 versions of a test, because that can create additional pressure on the researcher.


Sometimes we forget that this is a learning process, that we do want to publish an idea and want to learn, and others will come after and add to it, and correct or find limitations.


Michael Lones: I was just gonna say that I think another problem is that in many applied areas people don't get good feedback from reviewers and editors because the reviewers and editors don't understand machine learning. And I’m see in the fields I work in that very good journals have pretty bad machine learning papers.


Momin Malik: I think, talking to people with some training in econometrics or social statistics, that just explaining machine learning as kind of an instrumental use of correlations goes a long way. Why do we even need data splitting? Why don't we do data splitting in econometrics or social statistics generally? Maybe we should, but it's because if you're building a theory based model, the ways in which things can go wrong are very different. Overfitting is seldom a concern. If you're doing stepwise regression, which probably nobody should do now that we have lasso, then it's a concern.


But the concerns are very different, so just explaining how correlations can go wrong, why we do what we do, why you should avoid overfitting to the test set, think about data leakage—this has been somewhat effective, but that's just my anecdotal experience.


Deborah Raji: Yeah, I think that education is really important, and that's definitely an important intervention. I also think you can twist people's arm a little bit more—set some restrictions or boundaries of what we allow and don't allow, whether that's through the policies of different conferences or guidance provided to reviewers.


Michael’s paper and other work points to best practices, and we can enforce that as the standard required for publication, either in machine learning venues or other venues where people are beginning to use machine learning as a method. So I do think that we can definitely leverage some mechanisms we have for enforcement of other research integrity issues to also apply to reproducibility issues.


Another option that's been presented has been funders setting requirements around publishing your data or publishing your code. That is another opportunity to guide or push people a little bit more forcefully to to comply to some of these best practices. So yeah, there's lots of I think opportunities for these kind of things.


Sayash Kapoor: Cool. Actually, that leads us very well into our next question which was targeted to Deb, and then I think Momin can also take a moment to answer it because you also went into this.


So the question is, Deb you mentioned that we don't know what the best model across all settings is. But as the data, distribution and collection practices differ across contexts, do you think we should strive for generalizability or do you think we should just be more precise about the context in which our model is appropriate.


Deborah Raji: So I think for deployment of the current state-of-the-art machine learning models, I definitely prefer a more specific articulation of when the model is appropriate to deploy. So just being clear and communicating the data set and the scenarios in which we have a validated performance for. And then every scenario outside of those scenarios is effectively an unknown, and being very honest about that I think is a very critical step forward. That's why a lot of.my interest has been in documentation strategies of just communicating to different stakeholders the limits of where a system can be applied.


I understand the desire of the machine learning community to extrapolate. I’m also very curious what Momin’s response will be to this question. But yeah I totally agree with you that the there is a very strong interest with generalization in the machine learning community. And that's because you want sort of this all-encompassing all-in-one machine learning model that you can deploy everywhere.


The challenge of course being that that's not the practical case. So for me, it's just honest reporting on evaluation and where you have validated results, where the results are unknown, and which populations you validated results on. Communicating that through a model card or other sort of transparency mechanism for me is sort of the approach that makes the most sense.


In our learning at paper we talked about how if you want to be able to claim broader sort of general applicability, then you need to evaluate it in a wider range of scenarios and report those results and still state that there's limitations beyond that. But if you want to apply it to like more than one case, then you have to evaluate it on more than one case to demonstrate the generalizability there, so we don't even do that.


So even starting there would be one way. I also wanted to add a final thing which is, there is this trend in machine learning and we have another paper called the “AI and Everything in the Whole Wide World Benchmark” where we talk about how data benchmarks have a history of being anchored to very specific tasks and applications. But in the machine learning community there's sort of these higher level intellectual objectives that people want the model to have: visual understanding or natural language processing, so they develop these sort of generalized cognitive ability benchmark. Things like GLUE or ImageNet are examples of this, but they're not actually tied to any practical task. So what performance on BLUE or ImageNet means with respect to some downstream tasks we don't actually know. But the prevalence and the elevation of these general benchmarks are “everything in the whole wide world benchmarks,” where they try to fit the whole world into a benchmark unreasonably, has also distorted our understanding of what it means to do well on these like downstream tasks, so I do think ignoring the way that data is localized and the way that these benchmarks represent very specific context has already led to sort of this delusion of generalizability that we see in the machine learning space. I’m against sort of the use of data benchmarks as the approach for that.


Sayash Kapoor: Amazing, thank you. Anything to add Momin?


Momin Malik: I also would say I’m much more in favor of limiting our claims and scope. In my dissertation, which was very much in computational social science, I concluded that Twitter doesn't generalize. If we studied Twitter, the only interest that's valid to use Twitter to study is what people do on Twitter.


We can try and link that externally, but that's so fraught that and so noisy that I’m not sure it's ever worth doing.


There are still are valuable and interesting things on Twitter, but our only kind of goal should almost be to study it anthropologically, as a one-off. That this tells us maybe about the variability of human behavior and experience, not the signal of central tendency.


I think we should also think about if a particular estimation procedure, rather than the specific estimate or the specific trained fitted model, might generalize. And that’s much more realistic, to think that we should allow retraining. It's not very satisfying that different genes in two populations in different states might be selected into a model, but we should be okay with that. We should okay with what's selected not being stable, necessarily, as long as it works effectively and robust out-of-sample testing for the relevant population and that's why I like that Cardoso et al. example we can discover things about how the model does generalize. And we can do audit testing on some groups.


But yeah definitely limiting our claims, giving up on generalizability, and giving up on universality, and that's a much larger theme that I try to argue about around science.


Sayash Kapoor: Amazing, thank you. Actually, someone from the audience just posted what was going to be my next question to the entire panel. So the audience question is: How difficult is it to point out such issues, such as failures in published papers. In my experience, researchers are people who potentially do not like being told when they're wrong. Do you think that such needed investigative papers are difficult to publish, or, even worse, that such papers can tarnish one’s reputation?


Marta Serra-Garcia: It seems to me that it really differs by field, so I think psychology, after the replication crisis, became really open and there's lots of interest and very well-published work criticizing prior work.


My experience in economics is that that's more rare and people take it perhaps a little bit more personally, but maybe that culture will change. Maybe in other fields it's a bit more open and I hope it will be, I think that would be helpful for all of us.


Deborah Raji: You know, I think people take it personally in every field.


I think reanalysis is definitely an objective. It seems like it was more common in the biological sciences, where they went through this open data movement in the last couple decades. Quantitative political science spaces will also share data and try to reanalyze things.


But yeah, people always take it personally.


There are RCTs that have been since debunked. I don't know if anyone is familiar with the “worm wars” where the RCT said that deworming pills led to all kinds of benefits, but when they actually published their data and other researchers reexamined it, they found that the results fell far short of what was initially reported. And then they actually publish their data and other researchers, were able to re examine it and they found that the results were felt far short of what was initially reported.


But then I read a book that was published like two years ago, the first example of like a successful RCT that they gave was this deworming RCT, which has since been debunked by multiple studies. So I feel that there’s a lot of challenges with getting people to actually take in the critique, but also even more challenges with making sure that people understand that it’s not personal and it’s really just in service of the science.


In a lot of my audit work as well, I find that when you confront a company saying their system doesn't work, they are also very defensive.


I think that we have strategies to mitigate this with corporations. In information security, if there’s a vulnerability, there is a strict communication protocol where they tell the company but also have an obligation to the public. So they tell the company: we found the vulnerability and that you have 30 or 90 days to respond, and then at that point we’re going to release it to the public. But if you have a response within that period we’re going to publish your response with our results, so the public understands that you’re addressing the issue. And I think something like that might minimize corporate hostility towards the results. We did that for the Gender Shades audit and it worked well.


I mentioned bug bounties as a possible solution, and the idea of incentivizing people to reexamine some major studies and major findings through just paying them. There’s also a Nature article where they talked about classes of students participating in reanalysis studies as part of their education. The reproducibility challenge at NeurIPS was very student heavy in terms of participation, where, as part of their learning process, they reimplement an algorithm. Someone also mentioned the idea of a reproducibility journal or having academic publishing incentives around reproducibility studies and reanalysis studies. At least a workshop where people can can demonstrate these results would be nice.


Sayash Kapoor: But it's actually super interesting because in our next session, Dr. Jake Hoffman is speaking, and he’s written a paper on what he calls “data analysis replications” where they work with undergrads to uncover these problems, so yeah that's that's a super interesting set of points. Michael, do you want to go next?


Michael Lones: Yeah, it can definitely be a problem getting critical work published, particularly if you're working in a very niche area. I’ve only had one paper where i'm pretty sure it was to critical a view and got rejected because of that.


I guess preprints helps get the information out there still, even if it's not formally published, and I think it's worth trying because it's important to get the idea across to people.


Sayash Kapoor: Okay, Momin, any thoughts?


Momin Malik: There is another quote, I think, from that same Phil Agre paper that when you come from a technical field, structural analysis can seem like a personal insult.


And he means kind of like sociological analysis of power structures not methodological critique, but I think a similar principle applies, that we should do a better job of building that in one of my academic heroes and mentors is Cosma Shalizi. For those of you who don’t know, he's famous for kind of calling bullshit.


He has a paper debunking the claims of what is it the Christakis and Fowler kind of framingham heart study contagion study, rejecting the idea of power law distribution is being found everywhere, because they're using kind of hundred year old, outdated stats methods.


I asked him about this, why don't you write more of these critiques, and he said there's only so far you can go in your career describing negative things. And so, even though I love that work, that’s the advice I’ve got.


Sayash Kapoor: Well, this is fascinating advice. It's also interesting that everyone is having the same experiences. It just shows you how difficult doing some of the critical work is. So it's great we are all here together today.


OK, the next question is for Deb again, but I think we can have like a lot of perspectives from all of the speakers as well.


And this question was asked when you were talking about reproducibility as an integrity issue.


So the question is: Framing reproducibility as an integrity issue it's a fascinating intervention. It could be very effective, but how do we make sure that we don't incorrectly accuse researchers of malfeasance in cases where reproducibility failures are due to oversight, and could you speak to how it has played out so far.


Deborah Raji: Yeah, so I kind of mentioned it in the context of the NeurIPS ethics review processes. We decided to integrate the checklist as something for the technical reviewers to flag if they thought that if a shortcoming could compromise the integrity of the results. And then that could be sent to an ethical reviewer or the PC depending on the nature of the problem, and what we found was that a lot of the cases that were flagged ended up not being insurmountable. So it ended up being a sort of a feedback session for the authors.


Within the review timeframe, a lot of the authors were able to fix the issue, and I don't think we ended up rejecting any paper because of reproducibility issues. It was just another way to flag for authors that this was a serious ethical consideration on par with the data ethics and broader impacts considerations. It would have been a really interesting question for the organizing committee to decide whether or not to reject papers that fell really short of the reproducibility best practice standards. That’s a conversation that the community is having right now.


But in my experience so far, it is actually not that much work to meet those standards, especially when they're pointed out to you so it can be addressed within the time frame and didn’t practically interfere with them publishing.


Sayash Kapoor: Thank you. That's so fascinating. Any other thoughts from the panelists?


If not, we can move to our next question: What do speakers think of MINIMAR and others reporting guidelines for research. So for context MINIMAR is the reporting for medical imaging and AI.


So I think like I can also give some context on this, I think a lot of checklists have been published in the last few years across domains, but specifically coming from medical sciences. And so, if you have experiences in your research dealing with any of them, as well as we with models cards, which Deb has knowledge of. So what do you think about what what constitutes a good checklist?


Michael Lones: I think checklists is a good idea, but I do worry about whether they're understandable to everyone is going to use them. Because in a lot of fields, as I keep saying, people don't have much experience and these checklists tend to be written quite technical way.


And I think there's certainly hope for helping people more to understand them. We're perhaps producing documents that get into more detail, that guide them through the process.


Deborah Raji: Yeah, I think I would agree. For the NeurIPS checklist, collaborators and other domains, and even people submitting to application track from other domains, struggle to fully understand what is being discussed in these in these checklists. There is quite a lot of vocabulary, and they aren’t necessarily clear across domains. So I think having domain specific checklists is always really helpful just to make sure that everyone using the checklist understands the vocabulary and is doing the same thing in response to the requirements.


I will say like even something like model cards is really hard to get people to implement. I know that it was a challenge even initiating that effort at Google. Sometimes you have engineers that are just not used to doing certain things when it comes to evaluation, even if it's a very simple set of questions. This points back to Momin’s talk around the need for cultural change.


The checklist kind of pushes you towards behaviors that you might not be used to doing. It's not something you can do like two hours before submission. It requires effort and redesigning how you evaluate these systems. Ideally, checklists actually shift practices in meaningful ways. Getting people to react to checklists in the right way is a big challenge.


Momin Malik: I agree with that completely. I’m working on a checklist internally with Mayo right now as well.


Part of my frustration is that the things that I care about I often don't see reflected in some of the proposed existing documentation. As the most direct example, fundamentally, I’m interested in what is being correlated with what. Like if I know that word frequency is being correlated with ICD10 codes, that would help me enormously.


The larger point is discussed in this nice sociological work Arthur Stinchcomb’s “When Formality Works.” Computer scientists are familiar with APIs, so let me use that as an example. The idea of an API is that the user doesn't have to understand how something connects to the underlying thing itself, how the abstraction works. They can just learn the abstraction and use it.


And that's how checklists work. But also, Amy Winecoff has a comment in the chat that there's a certain elitism to that. And Stinchcomb talks about how you need people who can go behind the abstraction to repair the inevitable ways in which it will go wrong.


I don't know how to how to democratize that. I don't know how to give people more of an understanding of the underlying complexity that we're trying to abstract. I think that's the much more reliable way by which people can deal with all the ways in which things go wrong.


Until then, checklists are way to proceed. Sayash, I’m reading some of the things that you put together about how checklists improve performance, and that's convincing. I think fundamentally I’d love to see this more democratized but I don't think there's human solution for the need for abstraction, at least in the scale of our civilization and all the flaws that any abstraction will inevitably bring, whether it's an API, a statistical model, a legal framework, or whatever it is.


Marta Serra-Garcia: I want to add that every checklist will have its drawbacks and its weaknesses. We have the same with preregistrations and there's different approaches. Making it easy, making it accessible is key. So it doesn't have to be complicated and extremely standardized to initiate this this process that Deb was talking about. Even when we get started in the project, thinking about how are we doing what we're doing, I think that's already a big start step forward.



Additional audience questions


Can you say more about the types of problems where you think better auditing tools (and regulation tied to audits) are particularly valuable vs problems where auditing is potentially a distraction from questions of whether or not algorithmic prediction should even be used?


Momin Malik: This is a good question and something that I am working on! The framework of "prediction policy problems" (Kleinberg et al., 2015) was a start to this, but I found that unsatisfactory, since they list credit scores as an example of a prediction policy problem when in fact this is something that has led to huge injustices (Caley Horan's dissertation, recently published as the book Insurance Age, talks about this, but so do many others). I think we can come up with better formal frameworks, but there need to be places for input and control (e.g., with real veto ability) from affected marginalized groups, who will be able to answer the "should" question better than almost anybody working on ML systems. So for that, we should focus on setting up things like Community Review Boards.


You mentioned that audit research focusing on hiring discrimination misses challenging their basic premises. But you also talked about how audits are required and are one of the solutions to fix existing issues. How do you reconcile the two? Shouldn’t the primary purpose of audits be to first figure out if indeed there is a concern, which can then be addressed?


Momin Malik: Yes and no. We can reason from first principles, and consult with marginalized members of affected communities, for whether something—even if it doesn't result in measurable disparities—is still based on an unjust premise. Since things based on unjust premises (e.g., if and when it is valid to use things over which people bear no responsibility for governance decisions that benefit or punish them) can go wrong in the future even if they are fine now, audits are neither necessary nor sufficient to identify problems.


To what extent can structures like tidymodels in R and pipelines in scikitlearn provide guardrails to prevent problems in ML?


Momin Malik: To some extent. Where there are software tools to make standard best practices easier to implement, it will be that much easier for people to do things, especially when they don't know what best practices are or why they are best. But when we formalize things, we take away some measure of interaction with people reasoning through what is happening and making decisions relevant to the specific case. And there are some things that no pipeline or software tools will address. So while we can and should make progress here, the problem won't be totally solved through approaches like this alone.

As a quick example, I talked about how we should do data splitting in ways that minimize dependencies across training and test set. The problem is that there is no general way to estimate these dependencies. We only observe the data once, and an n=1 is not enough to estimate a totally general variance-covariance matrix between observations (note: not between variables/features). We have to use domain knowledge and reasoning to decide what independence assumptions are justified, such that we can estimate some covariance. E.g., if a time series, we can check the ACF and PACF to decide what AR or MA order is sufficient for splitting data by. If a social network, we can do a graph coloring approach. Those parts can be automated or standardized, but the choice of which approach to take cannot be automated, and must be reasoned through by a modeler/analyst.


Is it the case that the cited papers (non replicable papers) use proprietary information and hence cannot be made public, but present an approach/technique that is widely applicable and hence result in increased citations?


Marta Serra-Garcia: That’s a great question. It is not the case. All papers were based on laboratory experiments. The instructions and explanation for implementation of the experiments were always available, both for papers that replicated and those which did not. When analyzing how non-replicable papers were cited, we also observed that they were mainly cited for their findings (not their methods) – though we did not measure the exact frequency. Hence, the difference in citations is unlikely to stem from an approach that is broadly applicable (and more prevalent in non-replicable papers).

Panel 3 Q&A: Future paths

Moderator: Arvind Narayanan

Speakers: Jake Hofman, Jessica Hullman, Brandon Stewart


Arvind Narayanan: We've talked about a lot of problems throughout this workshop. That's important, of course, but, considering that this is the final session, perhaps we can strike a more optimistic note, thinking about the future.


Could each of you give a vision, not necessarily a prediction, for the future of machine learning in which things are a lot more rigorous than they are today?



Jake Hofman: Thinking about Jessica and Brandon's talks, I think they both offer a really great vision for what this could look like.


Brandon's talk points to the idea that we're all trying to learn from data and that we need to be explicit about the goals that we have in doing so, separate from the methods that we go about in achieving those goals.


Those methods do look really different, especially between disciplines like social sciences and computational fields like machine learning, as Jessica pointed out.


We should be as rigorous with our evaluation of reform for doing these sciences, as we are with the science itself, as Jessica pointed out.



Jessica Hullman: We need more explicit integration of human domain knowledge which I think is not just new methods, but a whole enterprise in itself.


There's a mismatch between what we thought we were doing and what is actually happening or what we thought was being learned / how it was being learned and what's actually happening.


A first step is understanding human expectations about prediction and learning and where things go wrong.


We need to understand where expertise or hunches are useful versus not, because we shouldn't expect that just always putting any human in the loop is always going to lead to performance benefits.


Some of Brandon's points were great, we need to really understand what our goal is and what we're claiming in order to later figure out whether it went wrong or not. We're still in this area where we can't even tell because we haven't actually made a lot of things explicit.



Brandon Stewart: I really love the other presentations. One of the themes is that we need to re-center the expectation that it is just really hard to learn things about the world. It's hard to be explicit about what we're trying to do and it's hard to do it well.


As Jessica said in the presentation, making policy around this is really hard. It's easy to hope these things would be better, but saying what you would do to get there is just profoundly difficult. This has been raised by many people, it ultimately is going to need some change in the incentive structure. Because, ultimately it's just too costly to do some of these things if not everyone's playing by the same rules.


Psychology has done a really good job of restructuring the incentive set really rapidly.


But what I would love to see both in the social sciences and on the ML side is an environment where people are replicating the findings that are most important and that having your work be replicated is like like a gold star in your CV that you're excited about because it meant that someone thought it was important enough of a finding to spend the time to go acquire extra evidence.


And so you're incentivized towards other people replicating your findings and finding that they hold up, then that incentivizes you to do the kind of work that's going to hold up in other contexts and, unfortunately, I think the issue is now that incentive structure isn't quite there. But possibly going forward.



Arvind Narayanan: Jessica, when you were talking about getting away from dichotomous thinking where a claim is either true or false ... I was wondering if you can talk about what some of the downstream effects are going to have to be. Because when I read papers, and I assume this is true of many of us, we're still looking for claims to be either true or false. So how should we change our behavior as consumers of research? And as teachers in the classroom? Again we have this bad habit of presenting most things in a binary way.



Jessica Hullman: One simple thing is never try to interpret the results of a single paper without looking at the broader context.


Often we end up in situations where there's a lot of unquantified uncertainty in previous work that we just ignore when we write papers. When we read papers that conflict slightly with what someone else found, we just won't worry about it and just cite this paper and forget about that one. We need to stop ourselves when we come across conflicting findings.


I like the idea of multiverse analysis or just much more sensitivity analysis, but there's so much complexity already in ML that I'm not sure how this would be implemented. It makes it harder to write a paper... suddenly in order to write a paper that has a clear claim you really have to understand what you're dealing with, and so I think sensitivity analysis would be wonderful.


One of the hardest parts of science reform is that we cannot encode so much information to memory when we read a paper. So we really need to get better at crafting accurate and uncertainty-sensitive claims in our papers, so that uncertainty is baked into the message.



Brandon Stewart: Part of what goes with the dichotomous thinking is universalistic thinking. We should be saying when we would expect our theories to break and that would ideally be normalized if you present a theory.


Your claims don't really mean anything until you say, this is the context in which I think it won't work anymore. Similarly, when we present a new method and we say it's great, we should be showing what's the point when it breaks, because that helps us understand it better, but I think also drives home the message to the reader that this isn't a universal thing, but a contextual thing, even if in a context that is super important to understand.



Jessica Hullman: There's an interesting paradox where in order to make a methodological improvement, sometimes we have to act as though there's a universal method and we just need to fix it, but we cannot ever really believe that.



Jake Hofman: In the social sciences and to some extent for machine learning, we seem to want stylized facts to pop out, and we don't want the answer to be "well, it depends", or "the effect will be present under these conditions, but not these other conditions". That's a very hard habit to break ourselves out of, but seems very important, because it seems less and less likely that we're going to achieve universal stylized claims.




Arvind Narayanan: Let's jump to the audience questions. A question for Brandon: In the prediction based on vocally described autopsies, would it make sense to select features that are commonly identified at home or would that be adding information from the testing data into the final model?



Brandon Stewart: One of the results that's incredibly robust from the verbal autopsy literature (which I'll just be completely clear, I am not an expert in) is that the best predictor of accuracy isn't the algorithm you use to do the prediction, but the sites that you train on. So if you train on data from say Mexico that matters a whole lot more than anything else, including what features people are particularly good at answering the questions to as well as the kinds of diseases that are presented in that particular circumstance and so.


We need both an understanding of the psychology of what people are going to answer well as well as what diseases are common in that particular area. The challenge of course is that they're trying to standardize the features they're asking about across the whole world, which is a very difficult thing and precludes the possibility of using contextually specific features.




Arvind Narayanan: The next question is for Jake and Jessica from Momin, who says, While I didn't mention it in my talk, I'm a fan of predictability and I've used that in my work. And I'm glad that I gave some details to set up the talks. I'm curious about if this is purely an issue of theory and study design or if specific models could pursue integrated modeling.


Jake Hofman: It relates back to Brandon's point about cats versus people. If you are in a cat situation, and you know the data are present and that features are present, then sure, you can start to do these things. But there are often situations where we're in more of the Fragile Families type of scenario where we're not sure that we even have measured the things that might be relevant, in which case it's hard to imagine any method that can discover these things.


There's really nice work on limits to prediction by Annie Liang and John Kleinberg [and Sendhil Mullainathan]. They show theoretical gaps, where we know we have more to learn, because, even with the features we have, we can achieve higher performance with machine learning models than with theoretical models. And so that's one case where you think okay, maybe you have the features you need, but there's still some functional form or relationship you need to learn.


But in a situation where you don't have those features and you don't even necessarily know what they are, I think that's a really interesting and puzzling situation. I wouldn't expect those specific tools to be able to help there necessarily.


Jessica Hullman: I was also going to mention this study Fudenberg and Liang and Kleinberg [and Mullainathan]. Sometimes it's useful to think about predictability as part of the environment that you're trying to predict in. They think about how humans predict random sequences or produce random sequences. They do exhaustive data collection so that they can understand how predictable the sequence of 1s and 0s produced by the average person is. There's some underlying randomness in this data and machine learning can help us understand how much inherent variability there is.


That's less about the theory and more just about understanding the environment. But we do need theory and study design, and if we have a good method it's only because of that.




Arvind Narayanan: The next question is for Jake. Dr. Lones mentioned earlier about the difficulty of grokking machine learning for newcomers. One reason is likely, the use of complex math. What body of math knowledge do you expect undergrads to have to be able to run data replication studies by themselves?


Jake Hofman: Brandon made the point that there's not the incentive structure to do replication, at least for tenure track faculty. But there are many, many students who are looking to learn methods and statistics and machine learning and we train them in classes.


So we asked the question of whether they could actually do replications that are meaningful. The incentives are there in the sense that if it's a class project and students are assigned to do it, they'll do it.


But they'll also learn things along the way, and so it can be beneficial for the research community and for the students.


We've done this a few times and with pretty fundamental training in probability and statistics and a bit of regression modeling, linear regression, logistic regression.


Students have been able to run these replications and actually one of the replications that the students did was on the Fryer paper. With just a couple of introductory classes, they were able to find some interesting details that were glossed over mythologically in the paper.


You can certainly pick out examples where maybe you need to know very advanced software packages or mathematics, but I think there is a huge swath of papers with simpler methods where students can run these analyses.


They're much more available as well compared to psychology replications where we have to go out and collect new data. We're talking about re-running the analysis on existing data.




Arvind Narayanan: Brandon, the measurement paper by Jacobs and Wallach came up in the chat. Could you elaborate on the relationship between your work and that paper?


Brandon Stewart: Everyone should read this great paper thinking about the connections between the measurement literature and the fairness literature, published in FAccT 2021. It self consciously imports some of the ideas from the social science literature. They're heavily drawing on early work in text analysis and political science by Kevin Quinn, Burt Monroe, and colleagues, who are in turn importing the validity framework that comes out of the psychometrics literature in the 60s and 70s.


The connection between the two is that it's all about being very precise about what you are hoping to get from the data, being explicit about the goal. Validity and measurement rely on the idea that there's something that we're trying to capture. Our work clearly rhymes with them in the sense that it comes from the same methodological tradition.


What's distinct is that they're thinking primarily about measurement and what it could say about fairness. We're actually taking the measurement for granted, not because we're enthralled by the present state of measurement but because there's enough problems even if you say all the measures are good. There's a whole other layer which is how we choose to measure the things that go into our models and how that interacts with what we're trying to estimate.



Arvind Narayanan: The last question is for everyone, but it's inspired by Jake's four quadrants. While it's clearly true that there's not nearly enough work on integrated modeling, it also seems to me that there's not enough description work. If that is true, why is that, and how do you encourage that kind of work, and why is it important?


By descriptive work I mean just collecting datasets and describing them and describing what's going on in the world, rather than making an inferential claim about some population. Think of a biologist describing the appearance of new species of bird that they found.


Jessica Hullman: People want to be clever and it's less obvious how clever you are when you're doing something descriptive. We want to test hypotheses or come up with something new.


But it's interesting how much work at the HCI/AI is getting more descriptive, even if it's not quantitative. And even some of the work on limitations of machine learning. We see researchers just going in and looking at how engineers are working. It's encouraging that some pockets of the community are trying to remove their need to make some clever solution and and just describe what's going on.


Jake Hofman: Even a biologist who says I discovered a new species of bird is making a clever statement, as opposed to just, I went out and I measured the distribution of wings sizes for birds and I have a new instrument for doing so. That's less clever sounding and less interesting sounding but very important.


So the drive to explain things in a very broad way works against more minor sounding things that are still foundational and very important.


Brandon Stewart: I completely agree. If you describe something interesting, people are going to hunger for an explanation of what is happening,


The journal of quantitative description was started by Kevin Munger, Andy Guess, and Eszter Hargittai. Talking to Andy a year or so in, it's actually really, really hard to tell when a thing is purely descriptive because you're still baking in some theoretical understanding of the world, why you're choosing to describe the things you're trying to describe how you're choosing to operationalize those properties and giving an account of what's happening


If you think of the study about diffusion dynamics that Jake mentioned, it was about how things spread through social media. It was very descriptive in terms of, here are some patterns, but it also had an account of the way the process unfolded, which is in some sense a causal account, the idea being that the thing makes its way to a broadcaster, who then pushes it out to lots of people, as opposed to being viral in the sense that we tend to think about.


No observation is theory free and that's the fundamental challenge.



Chat question: Does the choice of an imputation strategy subtly change the proposed/claimed estimand?


Brandon Stewart: This is the interesting question: does the procedure someone uses change the goal? So when the goal is stated it doesn’t, but when we are just guessing the goal (estimand) we might infer it is the estimand where a thing works well. When we don't impute for example, we might be interested only in the observed people, but we might also be assuming that the unobserved people are like the observed people (conditional on imputation variables).