Research Projects

Collaboration Challenges in Building Production Machine Learning (ML) Systems

Introduction

We identified three collaboration points that seemed particularly challenging from the interviews:

(1) Identifying and decomposing requirements,

(2) negotiating training data quality and quantity, and

(3) integrating data science and software engineering work. We found that organizational structure, team composition, power dynamics, and responsibilities differ substantially, but also found common organizational patterns at specific collaboration points and challenges associated with them.

Our observations suggest four themes that would benefit from more attention when building ML-enabled systems.

(1) Communication: Invest in supporting interdisciplinary teams to work together (including education and avoiding silos).

(2) Documentation: Pay more attention to collaboration points and clearly document responsibilities and interfaces.

(3) Engineering: Consider engineering work a key contribution to the project.

(4) Process: Invest more into process and planning.

Methods

To better understand collaboration challenges and avenues toward better practices, we conducted interviews with 45 participants contributing to the development of ML-enabled systems for production use. Participants come from 28 organizations, from small startups to large big tech companies, and have diverse roles in these projects, including data scientists, software engineers, managers, etc. During our interviews, we explored organizational structures, interactions of project members with different technical backgrounds, and where conflicts arise between teams.

Results: Collaboration Point - Requirements and Planning

In an idealized top-down process, one would first solicit product requirements and then plan and design the product by dividing work into components (ML and non-ML), deriving each component's requirements/specifications from the product requirements. In this process, there are multiple collaboration points: (1) product team needs to negotiate product requirements with clients and other stakeholders; (2) product team needs to plan and design product decomposition, negotiating with component teams the requirements for individual components; and (3) project manager of the product needs to plan and manage the work across teams in terms of budgeting, effort estimation, milestones, and work assignments. Few organizations, if any, follow such an idealized top-down process, and it may not even be desirable. We see differences in terms of the order in which teams identify product and model requirements:

Model-first trajectory -- focus on building the model first and build a product around the model later;

Product-first trajectory -- models are built later to support an existing product; and

Parallel trajectory -- no clear temporal order, model and product teams work in parallel.

Product and Model Requirements

We found a constant tension between product and model requirements in our interviews.

- Product requirements require input from the model team. It is difficult to elicit product requirements without a good understanding of ML capabilities, which almost always requires involving the model team and performing some initial modeling when eliciting product requirements. Regardless of whether product requirements or model requirements are elicited first, data scientists often mentioned unrealistic expectations of model capabilities.

- Model development without clear model requirements is common. Participants from model teams frequently explain how they are expected to work independently, but are given sparse model requirements. They try to infer intentions behind them, but are constrained by having limited understanding of the product that the model will eventually support.

- Provided model requirements rarely go beyond accuracy and data security. Requirements given to model teams primarily relate to some notion of accuracy. Beyond accuracy, requirements for data security and privacy are common, typically imposed by the data owner or by legal requirements. When prompted, very few of our interviewees report considerations for fairness and explainability either at the product or the model level.

Project Planning

- ML uncertainty makes effort estimation difficult. Irrespective of trajectory, 19 participants mentioned that the uncertainties associated with ML components make it difficult to estimate the timeline for development of ML components and by extension the product. Model development is typically seen as a science-like activity, where iterative experimentation and exploration is needed to identify whether and how a problem can be solved, rather than as an engineering activity that follows a somewhat predictable process.

Results: Collaboration Point - Training Data

Data is essential for machine learning, but disagreements and frustrations around training data were the most common collaboration challenges mentioned in our interviews. In most organizations, the team that is responsible for building the model is not the team that collects, owns, and understands the data, making data a key collaboration point between teams in ML-enabled systems development. We observed three patterns around data that influence collaboration challenges from the perspective of the model team:

Provided data -- the product team has the responsibility of providing data to the model team;

External data-- the model team relies on external data providers like uses publicly available resources or hires a third party for collecting or labeling data; and

In-house data -- product, model, and data teams are all part of the same organization and the model team relies on internal data from that organization.

Negotiating Data Quality and Quantity

Disagreements and frustrations around training data were the most common collaboration challenges in our interviews.

- Provided and public data is often inadequate. In organizations where data is provided by the product team, the model team commonly states that it is difficult to get sufficient data. The data that they receive is often of low quality, requiring significant investment in data cleaning. When the model team uses public data sources, its members also have little influence over data quality and quantity and report significant effort for cleaning low quality and noisy data.

- Data understanding and access to domain experts is a bottleneck. Existing data documentation (e.g, data item definitions, semantics, schema) is almost never sufficient for model teams to understand the data. In the absence of clear documentation, data understanding and debugging often involve members from different teams and thus cause challenges at this collaboration point.

- Ambiguity when hiring a data team. When the model team hires an external data team for collecting or labelling data, the model team has much more negotiation power over setting data quality and quantity expectations. Our interviews did not surface the same frustrations as with provided data and public data, but instead participants from these organizations reported communication vagueness and hidden assumptions as key challenges at this collaboration point.

- Need to handle evolving data. In most projects, models need to be regularly retrained with more data or adapted to changes in the environment (e.g., data drift), which is a challenge for many model teams. When product teams provide the data, they often have a static view and provide only a single snapshot of data rather than preparing for updates, where model teams with their limited negotiation power have a difficult time fostering a more dynamic mindset. Conversely, if data is provided continuously, model teams struggle with ensuring consistency over time. Data sources can suddenly change without announcement, surprising model teams that make but do not check assumptions about the data.

- In-house priorities and security concerns often obstruct data access. In in-house projects, we frequently heard about the product or model team struggling to work with another team within the same organization that owns the data. Security and privacy concerns can limit access to data, especially when data is owned by a team in a different organization, causing frustration, lengthy negotiations, and sometimes expensive data-handling restrictions for model teams.

Results: Collaboration Point - Product-Model Integration

To build an ML-enabled system both ML components and traditional non-ML components need to be integrated and deployed, requiring data scientists and software engineers to work together, typically across multiple teams. We found many conflicts at this collaboration point, stemming from unclear processes and responsibilities, as well as differing practices and expectations. We saw large differences among organizations in how engineering responsibilities were assigned, and found the following patterns:

Shared model code -- the model team is responsible only for model development, while the product team takes responsibility for deployment and operation of the model;

Model as API -- the model team is responsible for developing and deploying the model; and

All-in-one -- a single team shares all responsibilities.

Responsibility and Culture Clashes

While interdisciplinary collaboration is already challenging, we observed many conflicts between data science and software engineering culture, made worse by unclear responsibilities and boundaries.

- Team responsibilities often do not match capabilities and preferences. When the model team has responsibilities requiring substantial engineering work, we observed some dissatisfaction when its members were assigned undesired responsibilities. Data scientists preferred engineering support rather than needing to do everything themselves, but can find it hard to convince management to hire engineers. In contrast, when deployment is the responsibility of software engineers in the product team or of dedicated engineers in all-in-one teams, some of those engineers report problems integrating the models due to insufficient knowledge on model context or domain, and the model code not being packaged well for deployment. In several organizations, we heard about software engineers performing ML tasks without having enough ML understanding.

- Siloing data scientists fosters integration problems. We observed data scientists often working in isolation--known as siloing--in all types of organizational structures, even within single small teams and within engineering-focused teams. In such settings, data scientists often work in isolation with weak requirements without understanding the larger context, seriously engaging with others only during integration, where problems may surface.

-Technical jargon challenges communication. Participants frequently described communication issues arising from differing terminology used by members from different backgrounds, leading to ambiguity, misunderstandings, and inconsistent assumptions. These challenges can be observed more frequently between teams, but they even occur within a single all-in-one team with members from different backgrounds.

- Code quality, documentation, and versioning expectations differ widely and cause conflicts. Many participants reported conflicts around development practices between data scientists and software engineers that arise during integration and deployment. Participants report poor practices that may also be observed in traditional software projects; but particularly software engineers expressed frustration in interviews that data scientists do not follow the same development practices or have the same quality standards when it comes to writing code.

Quality Assurance for Model and Product

During development and integration, questions of responsibility for quality assurance frequently arise, often requiring coordination and collaboration between multiple teams.

- Model adequacy goals are difficult to establish. Offline accuracy evaluation of models is almost always performed by the model team who is responsible for building the model, though often have difficulty deciding locally when the model is good enough. Model team members often receive little guidance on model adequacy criteria and are unsure about the actual distribution of production data.

- Limited confidence without transparent model evaluation. Participants in several organizations report that model teams do not prioritize model evaluation and have no systematic evaluation strategy (especially if they do not have established adequacy criteria they try to meet), performing occasional "ad-hoc inspections" instead.

- Unclear responsibilities for system testing. Teams often struggle with testing the entire product, integrating ML and non-ML components. Model teams frequently explicitly mentioned that they assume no responsibility for product quality (including integration testing and testing in production) and have not been involved in planning for system testing, but that their responsibilities end with delivering a model evaluated for accuracy.

- Planning for online testing and monitoring is rare. Due to possible training-serving skew and data drift, literature emphasizes the need for online evaluation. Online testing usually requires coordination among multiple teams responsible for product, model, and operation. We observed that most organizations do not perform monitoring or online testing, as it is considered difficult, in addition to lack of standard process, automation, or even test awareness.

The detailed results can be found in this paper. The paper received the ACM-SIGSOFT Distinguished Paper Award in ICSE 2022. Find the conference talk here.

Mining Machine Learning Production Systems in Open Source

As Machine Learning (ML) has been receiving massive attention for incredible advances and surpassing human-level cognition in many applications, the significance of analyzing the ML repositories have become necessary in many different contexts like evaluating maintainability, identifying common practices around ML components, etc. However, mining open-source to find the ML production systems is a challenge in itself, because of lack of indicators to distinguish production systems from ML tools or toy projects. Thus, our target in this research project is to define a set of indicators to find such projects and report a dataset of ML production systems.