Challenges in Deploying and monitoring Machine Learning Systems

ICML Workshop - Messe Wien Exhibition & Congress Center, Vienna AUSTRIA

Friday 17th, July 2020


Until recently, Machine Learning has been mostly applied in industry by consulting academics, data scientists within larger companies, and a number of dedicated Machine Learning research labs within a few of the world’s most innovative tech companies. Over the last few years we have seen the dramatic rise of companies dedicated to providing Machine Learning software-as-a-service tools, with the aim of democratizing access to the benefits of Machine Learning. All these efforts have revealed major hurdles to ensuring the continual delivery of good performance from deployed Machine Learning systems. These hurdles range from challenges in MLOps, to fundamental problems with deploying certain algorithms, to solving the legal issues surrounding the ethics involved in letting algorithms make decisions for your business.

This workshop will invite papers related to the challenges in deploying and monitoring ML systems. It will encourage submission on:

subjects related to MLOps for deployed ML systems, such as

    • testing ML systems,

    • debugging ML systems,

    • monitoring ML systems,

    • debugging ML Models,

    • deploying ML at scale;

subjects related to the ethics around deploying ML systems, such as

    • ensuring fairness, trust and transparency of ML systems

    • providing privacy and security on ML Systems;

useful tools and programming languages for deploying ML systems;

specific challenges relating to

    • deploying reinforcement learning in ML systems

    • and performing continual learning and providing continual delivery in ML systems;

and finally data challenges for deployed ML systems


Makerere University and Google

"Deploying Machine Learning Models in a Developing World Context"

ABSTRACT: Successful deployment of ML models tends to result from a good fit of the technology and the context. In this talk I will focus on the African context which is synonymous with developing context but I want to argue there is a difference. I will expound on the opportunities and challenges that this unique context provides and the assumptions made in deploying in such a context and how well they fit. Another angle of the talk will be on deployment with a view to influence societal good which may be different from deployment in a production system. I will also draw insights from some projects I have been engaged in towards this end.


"Successful Data Science in Production Systems: It’s All About Assumptions"

ABSTRACT: We explore the art of identifying and verifying assumptions as we build and deploy data science algorithms into production systems. These assumptions can take many forms, from the typical “have we properly specified the objective function?” to the much thornier “does my partner in engineering understand what data I need audited?”. Attendees from outside industry will get a glimpse of the complications that arise when we fail to tend to assumptions in deploying data science in production systems; those on the inside will walk away with some practical tools to increase the chances of successful deployment from day one.


"System-wide Monitoring Architectures with Explanations"

ABSTRACT: I present a new architecture for detecting and explaining complex system failures. My contribution is a system-wide monitoring architecture, which is composed of introspective, overlapping committees of subsystems. Each subsystem is encapsulated in a "reasonableness" monitor, an adaptable framework that supplements local decisions with commonsense data and reasonableness rules. This framework is dynamic and introspective: it allows each subsystem to defend its decisions in different contexts--to the committees it participates in and to itself.
For reconciling system-wide errors, I developed a comprehensive architecture that I call "Anomaly Detection through Explanations" (ADE). The ADE architecture contributes an explanation synthesizer that produces an argument tree, which in turn can be traced and queried to determine the support of a decision, and to construct counterfactual explanations. I have applied this methodology to detect incorrect labels in semi-autonomous vehicle data, and to reconcile inconsistencies in simulated anomalous driving scenarios.
In conclusion, I discuss the difficulties in /evaluating/ these types of monitoring systems. I argue that meaningful evaluation tasks should be dynamic: designing collaborative tasks (between a human and machine) that require /explanations/ for success.

Facebook AI and Inria

"Conservative Exploration in Bandits and Reinforcement Learning "

ABSTRACT: A major challenge in deploying machine learning algorithms for decision-making problems is the lack of guarantee for the performance of their resulting policies, especially those generated during the initial exploratory phase of these algorithms. Online decision-making algorithms, such as those in bandits and reinforcement learning (RL), learn a policy while interacting with the real system. Although these algorithms will eventually learn a good or an optimal policy, there is no guarantee for the performance of their intermediate policies, especially at the very beginning, when they perform a large amount of exploration. Thus, in order to increase their applicability, it is important to control their exploration and to make it more conservative.
To address this issue, we define a notion of safety that we refer to as safety w.r.t. a baseline. In this definition, a policy considered to be safe if it performs at least as well as a baseline, which is usually the current strategy of the company. We formulate this notion of safety in bandits and RL and show how it can be integrated into these algorithms as a constraint that must be satisfied uniformly in time. We derive contextual linear bandits and RL algorithms that minimize their regret, while ensure that at any given time, their expected sum of rewards remains above a fixed percentage of the expected sum of rewards of the baseline policy. This fixed percentage depends on the amount of risk that the manager of the system is willing to take. We prove regret bounds for our algorithms and show that the cost of satisfying the constraint (conservative exploration) can be controlled. Finally, we report experimental results to validate our theoretical analysis. We conclude the talk by discussing a few other constrained bandit formulations.

Startup on ML production pipelines

"Bridging the gap between research and production in machine learning"

ABSTRACT: Machine learning has found increasing use in the real world, and yet a framework for productionizing machine learning algorithms is lacking. This talk discusses how companies can bridge the gap between research and production in machine learning. It starts with the key differences between the research and production environments: data, goals, compute requirements, and evaluation metrics. It also breaks down the different phases of a machine learning production cycle, the infrastructure currently available for the process, and the industry best practices.


  • Zhenwen Dai (Spotify): Model Selection for Production Systems

  • Erick Galinkin (Montreal AI Ethics Institute and Rapid7): Green Lighting ML

  • Camylle Lanteigne (MAIEI and McGill University): SECure: A Social and Environmental Certificate for AI Systems Deploy machine learning models serverlessly at scale

  • Yuzhui Liu (Bloomberg): Deploy machine learning models serverlessly at scale

  • Alexander Lavin (Augustus Intelligence): ML lacks the formal processes and industry standards of other engineering disciplines.

  • Alexander Lavin (Augustus Intelligence): Approaches to AI ethics must consider second-order effects and downstream uses, but how?



Mind Foundry


Mind Foundry


University of Cambridge