How do we develop machine learning models and systems taking fairness, accuracy, explainability, robustness, and privacy into account? How do we operationalize models in production, and address their governance, management, and monitoring? Model validation, monitoring, and governance are essential for building trust and adoption of AI systems in high-stakes domains such as hiring, lending, and healthcare. In this talk, we will highlight challenges faced by various stakeholders when operationalizing AI/ML models, and emphasize the need for adopting responsible AI practices not only during model validation but also post deployment as part of model monitoring. Please refer to our FAccT'22 tutorial for a detailed overview of techniques & tools for monitoring deployed ML models, industry case studies, key takeaways, and open challenges.
Common practice in Responsible AI is to develop metrics that represent the concerns we want to better understand, concerns of fairness, safety, privacy. These metrics are critical: they hold us accountable, help us find gaps in our product development, and drive meaningful, measurable, change. But even so, our metrics have their limitations, especially by the ways in which we define who we are evaluating and improving for. Through a set of short case studies, Tulsee will highlight some of the nuances in metric development for fairness evaluation, how the ways in which we define demographics can critically change our understanding, and the importance of clarity & accountability when discussing these goals.
This talk will explore how existing algorithmic fairness approaches are often unsuited to capturing fairness-related harms arising from language technologies, suggesting the need for language-focused measurement approaches to support NLP practitioners. At the same time, this talk will illustrate persistent challenges to developing benchmark datasets—one emerging such measurement approach—and call for FATE research that addresses measurement appropriate to language-related harms, and more broadly, NLP practitioners’ needs in their ethical work.
Many industries have centered on AI-based innovation in their business development. However, trust in AI output is crucial for the broad adoption of AI systems. For ensuring reliability, industrial practices use testing and debugging of their applications. In this talk, we discuss the unsolved problems related to the automated testing and debugging of AI models. Specifically, we emphasize incorporating user-driven specifications to create realistic test data and mapping misprediction to the data.