A Tutorial Workshop on ML for Systems and Systems for ML
Manisha Luthra, TU Darmstadt | Andreas Kipf, Amazon Web Services | Matthias Böhm, TU Berlin |

Local Chair: Lucas Woltmann, TU Dresden

Workshop 1
✳ Tue, 7th
☷ 09:00 - 12:30
☷ 13:30 - 15:00
☷ 16:00 - 18:00
☉ APB E023

Workshop Description

The current advances in machine learning (ML) have led to a wide adoption of ML in different application areas across academia as well as industry. On the one hand, these novel advancements has helped existing data management systems to improve by even sometimes completely replacing its components to enable so-called learned database components (ML for Systems area). On the other hand, the advancements in well-thought and engineered systems approach aid in improving current ML techniques (Systems for ML area).

It is often challenging for researchers to keep up with the pace of these two emerging research areas, which is what we aim to make easier by means of this workshop tutorial.

Therefore, this workshop will serve as a tutorial of recent work on the two research areas - ML for Systems and Systems for ML. The invited researchers working in these areas will present their already accepted peer-reviewed work and give hints on their current work and open research challenges they are currently facing.

With this we want to foster discussions and collaborations among the participants, and at the same time, give the speakers a platform to boost their work and gain visibility.

List of Speakers

09:00 - 12:30 Systems for ML

Alexander Renz-Wieland
TU Berlin

09:15 W-1 1 Adaptive Parameter Management for Efficient Distributed ML Training

For large ML tasks, distributed training has become a necessity for keeping up with increasing dataset sizes and model complexity. A key challenge in distributed training is to synchronize model parameters among cluster nodes and to do so efficiently. Parameter managers (PMs) facilitate the implementation of distributed training by providing cluster-wide read and write access to the parameters, and transparently handling partitioning and synchronization in the background (either among the cluster nodes directly or via physically separate server nodes).

In this talk, I report on a line of research that works on the efficiency of PMs for ML tasks with sparse parameter access, i.e., tasks in which each update step reads and writes only a small (or tiny) part of the model. Standard PMs are inefficient for such tasks: in our experiments, distributed implementations were slower than efficient single node implementations due to communication overhead. Our research aims to increase efficiency by making the PM adapt to the underlying ML task. We first present and evaluate a series of potential performance improvements in this direction, each making the PM more adaptive. For example, we explore (i) to dynamically adapt the allocation of model parameters, i.e., to relocate parameters among nodes during training, according to where they are accessed and (ii) to adapt the management technique of the PM to the access patterns of individual parameters, i.e., to employ a suitable management technique for each parameter. Each of these aspects can improve PM efficiency. However, each aspect also makes the PM more complex to use, because the application (i.e., the component that interacts with the PM) needs to control adaptivity manually.

To reduce complexity, we present a mechanism that enables automatic adaptivity, i.e., adaptivity without requiring the application’s manual control. With this mechanism, the application merely provides information about parameter accesses, in a way that naturally integrates into common ML systems. We describe a novel PM—called AdaPM—that adapts to ML tasks automatically based on the information provided by this mechanism. It decides what to do (i.e., which management technique to use for a specific parameter and where to allocate each parameter) and when to do so. It does so automatically, i.e., without further user input, and dynamically, i.e., based on the current situation. In our experiments, AdaPM enabled efficient distributed ML training for multiple ML tasks: in contrast to previous PMs, it provided near-linear speed-ups over efficient single node implementations.

With these results, we argue that PMs can be efficient for sparse ML tasks, and that this efficiency can be reached with limited additional effort from application developers.

Arnab Phani
TU Berlin

09:50 W-1 2 Fine-grained Lineage Tracing and Reuse in Multi-backend Machine Learning Systems

Machine learning (ML) and data science workflows are inherently exploratory. Data scientists pose hypotheses, integrate the necessary data, and run ML pipelines of data cleaning, feature engineering, model selection, and hyper-parameter tuning. The repetitive nature of these workflows, and their hierarchical composition from building blocks exhibit high computational redundancy. Existing work addresses this redundancy with coarse-grained lineage tracing and reuse for ML pipelines. This approach views entire algorithms as black boxes, and thus, fails to eliminate fine-grained redundancy and to handle internal non-determinism. In this talk, we first introduce Apache SystemDS, a declarative ML system for the end-to-end data science lifecycle. Following that, we present LIMA, a practical framework for efficient, fine-grained lineage tracing and reuse, implemented in SystemDS. Lineage tracing of individual operations creates new challenges and opportunities. We address the large size of lineage traces with multi-level lineage tracing and reuse, as well as lineage deduplication for loops and functions; exploit full and partial reuse opportunities across the program hierarchy; and integrate this framework with task parallelism and operator fusion. The resulting framework performs fine-grained lineage tracing with low overhead, provides versioning and reproducibility, and is able to eliminate fine-grained redundancy. Finally, we discuss our ongoing work on extending the LIMA framework to support multi-backend caching and reuse, and distributed memory management with efficient exchange of intermediates.

Ziawasch Abedjan
Leibniz University Hannover

10:25 W-1 3 Enforcing Constraints for Machine Learning Pipelines

Responsible usage of Machine Learning (ML) systems in practice does not only require enforcing high prediction quality, but also accounting for other constraints, such as fairness, privacy, or execution time. Typically these types of constraints are tackled through multi-objective functions and dedicated models.
In this talk, I present our ideas on how to leverage the step of feature selection to support constraints. We propose Declarative Feature Selection (DFS) to simplify the design and validation of ML systems satisfying diverse user-specified constraints. We benchmark and evaluate a representative series of feature selection algorithms. From our extensive experimental results, we derive concrete suggestions on when to use which strategy and show that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand.

Sebastian Schelter
University of Amsterdam

11:00 W-1 4 Provenance-based Screening of Machine Learning Pipelines

Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, when they already caused harm in production. This talk covers our ongoing work on enabling data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. We detail how to achieve this by instrumenting, executing and screening ML pipelines for declaratively specified pipeline issues, and analyzing data artifacts and their provenance to catch potential problems early before deployment to production.

Madelon Hulsebos
University of Amsterdam

11:35 W-1 5 Towards Table Representation Learning for End-to-End Data Analysis

We develop models that “understand” images, code, and natural language, and put them to use for driving cars, completing code, and writing essays. However, we have long ignored the tables that dominate the enterprise data landscape and drive many data analysis pipelines. To intelligently support or automate data analysis pipelines end-to-end, we need to shift our attention to Table Representation Learning (TRL). In this talk, I will discuss some work done towards automating data analysis tasks through learned representations of tables. I will discuss Sherlock; a model for surfacing the semantics of table columns to facilitate data validation and visualization, and present GitTables a large-scale corpus of tables extracted from CSV files on GitHub to fuel models for TRL. I will conclude by discussing some ongoing efforts and open challenges of TRL for end-to-end data analysis.

Stefanie Scherzinger
University of Passau

12:10 W-1 6 Challenges with JSON Schema Data Modeling in Systems for ML

JSON Schema is an important, evolving standard schema language for describing families of JSON documents. It is based on a complex combination of structural and Boolean assertions, and features negation and recursion. In this talk, we provide an introduction to this language, and present practically relevant problems in the static analysis of machine learning pipelines that involve JSON Schema, specifically, schema satisfiability, inclusion, and equivalence. We conclude with a demonstration of a tool for JSON Schema witness generation, which allows to answer such problems.

12:30 - 13:30 Lunch Break

13:30 - 15:00 Transition to ML to Systems

Maximilian E. Schüle
University of Bamberg

13:30 W-1 7 Teaching Blue Elephants the Maths for Machine Learning and Inspection

In this tutorial, we show how SQL can facilitate machine learning without requiring an extended SQL grammar. This will comprise inspection to detect technical biases. A technical bias occurs when a pipeline removes tuples, which changes the distribution frequency with regard to a sensitive value. We show further how to express gradient descent in SQL based on rules for automatic differentiation.

Stefan Hagedorn
TU Ilmenau

14:05 W-1 8 An Exploration of Approaches for Machine Learning in Database Systems

Python and Pandas DataFrames are a popular combination to implement data processing pipelines. However, when data is already stored in a database system, it is often better to execute the data processing steps inside the database system instead of downloading large data sets to the client machine. While translating DataFrame operations to SQL is not a big challenge, advanced features such as user-defined functions and the application of machine learning models for prediction and classification are often not directly applicable and open opportunities for various optimizations.

In this talk we present optimization opportunities for the execution of user-defined Python functions inside a database system. We further explore approaches to execute pre-trained machine learning models by using such UDFs and other techniques. We investigate challenges, different implementation possibilities as well as their benefits and trade-offs.

Giorgio Vinciguerra
University of Pisa

14:40 W-1 9 Advances in data-aware compressed-indexing schemes for integer and string keys

Compressed-indexing schemes for a collection of keys are the backbone of modern data systems, and their space and query time efficiency is crucial to the system's performance, particularly under continuously growing volumes of data.

We discuss some recent developments of these schemes, beginning with integer keys, and then delving into the more complex case of variable-length string keys.

The underlying principle driving these advancements is to take advantage of the regularities and trends in the input data by either integrating learned models, which uncover some new regularities and thus enable more effective compression, or by employing data-aware optimization approaches that orchestrate known encoding schemes for synthesizing the best data structure design.

We present experiments that demonstrate the robustness of these new approaches compared to the input-sensitive space-time efficiency of existing solutions. Finally, we conclude by discussing new research opportunities.

15:00 - 16:00 Coffee Break

16:00 - 18:00 ML for Systems and remote talks

Immanuel Trummer
Cornell University

16:00 W-1 10 Towards AI-Generated Database Management Systems

The year 2022 has been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects, aimed at exploiting advanced language processing for tasks such as parsing a database manual to support automated tuning, or mining data for patterns, described in natural language. Finally, I discuss our recent and ongoing research, aimed at synthesizing code for SQL processing in general-purpose programming languages, while enabling customization via natural language commands.

Theodoros (Theo) Rekatsinas
Apple

16:35 W-1 11 Marius: Machine Learning over Billion-scale Graphs on a Single GPU

This talk describes Marius, a software system that aims to make training modern AI models over billion-edge graphs dramatically cheaper. Marius focuses on a key bottleneck in the development of machine learning systems over large-scale graph data: data movement during training. Marius addresses this bottleneck with a novel pipelined architecture that maximizes resource utilization of the entire memory hierarchy (including disk, CPU, and GPU memory). Marius’ no-code paradigm allows users to simply define a model and enjoy resource-optimized training out of the box. This talk will describe how Marius can train deep learning models over graphs with more than a billion nodes and 100 billion edges using a single GPU.

Benjamin Hilprecht
TU Darmstadt

17:05 W-1 12 Learned DBMS Components 2.0

Database management systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current workload-driven learning approaches to enable learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical.

Hence, in this talk we present our vision to tackle the high costs and inflexibility of workload-driven learning called Learned DBMS Components 2.0. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train models that generalize to unseen data sets out-of-the-box. The idea is to train a model that has observed a variety of workloads on different data sets and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss further opportunities which are enabled by zero-shot learning.

Ryan Marcus
University of Pennsylvania

17:35 W-1 13 Kepler: Robust Learning for Faster Parametric Query Optimization

Most work on parameterized query optimization (PQO) has focused on either reducing optimization time while still approaching the quality of a full run of the query optimizer. But what if you could have your cake and eat it too? This talk will present Kepler, a robust learning framework for faster parametric query optimization that both optimizes query plans faster than prior techniques and produces higher-quality query plans than DBMS optimizers. Central to our method is Row Count Evolution (RCE), a novel plan generation algorithm based on perturbations in the sub-plan cardinality space. While previous approaches require accurate cost models, we bypass this requirement by evaluating candidate plans via actual execution data and training an ML model to predict the fastest plan given parameter binding values. Our models leverage recent advances in neural network uncertainty in order to robustly predict faster plans while avoiding regressions in query performance. Joint work with Lyric Doshi, Vincent Zhuang, Gaurav Jain, Haoyu Huang, Deniz Altinbuken, Eugene Brevdo, and Campbell Fraser.

Closing remarks