Title: The data systems grammar
Abstract
Data structures are everywhere. They define the behavior of modern data systems and data-driven algorithms. For example, with data systems that utilize the correct data structure design for the problem at hand, we can reduce the monthly bill of large-scale data systems applications on the cloud by hundreds of thousands of dollars. We can accelerate data science tasks by being able to dramatically speed up the computation of statistics over large amounts of data. We can train drastically more neural networks within a given time budget, improving accuracy.
However, knowing the right data structure and data system design for any given scenario is a notoriously hard problem; there is a massive space of possible designs while there is no single design that is perfect across all data, queries, and hardware scenarios. We will discuss our quest for the first principles of data structures and data system design. We will show signs that it is possible to reason about this massive design space, and we will show early results from a prototype self-designing data system which can take drastically different shapes to optimize for the workload, hardware, and available cloud budget using machine learning and what we call machine knowing. These shapes include data structure and system designs which are discovered automatically and do not exist in the literature or industry.
Bio
Stratos Idreos is an associate professor of Computer Science at Harvard University where he leads the Data Systems Laboratory. His research focuses on making it easy and even automatic to design workload and hardware conscious data structures and data systems with applications on relational, NoSQL, and data science problems. For his PhD thesis on adaptive indexing, Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award from the European Research Council on Informatics and Mathematics. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems and in 2020 he received the ACM SIGMOD Contributions award for his work on reproducible research. Stratos is also a recipient of the National Science Foundation Career award and the Department of Energy Early Career award.
Title: Lessons Learned from Using Machine Learning to Optimize Database Configurations
Abstract
Database management systems (DBMS) expose dozens of configurable knobs that control their runtime behavior. Setting these knobs correctly for an application's workload can improve the performance and efficiency of the DBMS. But such tuning requires considerable efforts from experienced administrators, which is not scalable for large DBMS fleets. This problem has led to research on using machine learning (ML) to devise strategies to automatically optimize DBMS knobs for any application. Current research suggests that ML can generate better DBMS configurations more quickly than what is possible with human experts. And since these ML algorithms do not require humans to make decisions, they can also scale to support tuning thousands of databases at a time. Despite the advantages of ML-based approaches, there are still several problems that one must overcome to deploy an automated tuning service for DBMSs.
In this talk, I will discuss the challenges we faced in using ML to optimize DBMS knobs and the solutions we developed to address them. My presentation will be in the context of the OtterTune database tuning service. I will also highlight the insights we learned from real-world installations of OtterTune for MySQL, Postgres, and Oracle.
Bio
Andy Pavlo is an Associate Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. He is also the co-founder of OtterTune (https://ottertune.com).
Title: AI-Native Database
Abstract
In big data era, database systems face three challenges. Firstly, the traditional heuristics-based optimization techniques (e.g., cost estimation, join order selection, knob tuning) cannot meet the high-performance requirement for large-scale data, various applications and diversified data. We can design learning-based techniques to make database more intelligent. Secondly, many database applications require to use AI algorithms, e.g., image search in database. We can embed AI algorithms into database, utilize database techniques to accelerate AI algorithms, and provide AI capability inside databases. Thirdly, traditional databases focus on using general hardware (e.g., CPU), but cannot fully utilize new hardware (e.g., AI chips). Moreover, besides relational model, we can utilize tensor model to accelerate AI operations. Thus, we need to design new techniques to make full use of new hardware.
To address these challenges, we aim to design an AI-native database. On one hand, we integrate AI techniques into databases to provide self-configuring, self-optimizing, self-healing, self-protecting and self-inspecting capabilities for databases. On the other hand, we can enable databases to provide AI capabilities using declarative languages, in order to lower the barrier of using AI. In this talk, I will provide the open challenges of designing an AI-native database. I will also take automatic database knob tuning, deep reinforcement learning based optimizer, machine-learning based cardinality estimation, automatic index/view advisor as examples to showcase the superiority of AI-native databases.
Bio
Guoliang Li is a Professor of Department of Computer Science, Tsinghua University, Beijing, China. His research interests include AI-native database, big data analytics and mining, large-scale data cleaning and integration. He is the General co-chair of SGIMOD 2021 and demo chair of VLDB 2021. He is working as associate editor for IEEE Transactions and Data Engineering, VLDB Journal, ACM Transaction on Data Science. He got several best paper awards in top conferences, such as CIKM 2017 best paper award, ICDE 2018/KDD 2018/VLDB 2020 best paper candidate, DASFAA 2014 best paper runner-up, APWeb 2014 best paper award, etc. He received VLDB Early Research Contribution Award 2017, and IEEE TCDE Early Career Award 2014.
Title: When DB for AI Meets AI for DB
Abstract
Vector data, ie., embedding data, is a common and critical data type in various kinds of AI applications. Vectors databases were emerging due to the everising demand for unstructed data analytics in AI-powered applications. During the past a few years, we’ve worked on Milvus, the open-sourced vector database system, which has been adopted by over 1000 enterprises world-wide. ( For more details, you can see our SIGMOD’21 paper Milvus: a Purpose-built Vector Data Management System).
Although a lot of vector processing related studies have been done to strike for a better tradeoff between cost, accuracy and performance, most of them are sensitive to data and require careful parameter tuning. Unfortunately, the high dimensionality of vector data makes it difficult, if not impossible, for humans to know the best configuration for a specific data collection and user requirement. Moreover, naive methods (e.g., grid search) to search for potentially good configurations are inefficient due to the complex interaction of parameters and high cost of each trial. To this end, we resort to machine learning methods, which are naturally good at handling high dimension knowledge, to capture the features of vector data and predict how good the database works with different configurations.
In this talk, we will talk about both some critical problems that we try to solve with machine learning methods in a complicated system built for high-dimensional data management, as well as the challenges we encountered in developing a vector database system.
Bio
Charles Xie is an expert in database and AI with more than 20 years of experience. He is the founder and CEO of Zilliz, an open-source software company developing unstructured database systems for AI applications. He is also serving as the board chairperson of LF AI & Data, an umbrella foundation of the Linux Foundation supporting open-source innovations in artificial intelligence, big data and analytics. Before Zilliz, Charles worked many years at Oracle US headquarters, where he was developing Oracle’s relational database systems and then became a founding member of the Oracle 12c cloud database project. The project proves to be a huge business success and has accumulated a revenue of over $10 billion to date. Charles holds a master’s degree in computer science from University of Wisconsin-Madison and a bachelor’s degree from Huazhong University of Science and Technology.
Dr. Xiaomeng Yi is a senior researcher in Zilliz. His research interests includes: high dimensional data indexing and searching, applied machine learning in systems, workload scheduling and resource allocation in distributed systems. He got his PhD from School of Computer Science, Huazhong University of Science and Technology in 2017. Before joining Zilliz, he was a senior engineer at Huawei.
Carlo A. Curino (Microsoft)
Title: Cloud Tuning, a first practical step in the epic AIDB journey
Abstract
In this talk, I argue that the AIDB vision arcs from basic use of ML techniques to tune system defaults, all the way to replacing large portions of a DBMS with ML models. In this talk, I focus on Cloud Tuning a practical instantiation of the AIDB vision, and describe it from my vantage point in the Gray Systems Lab (GSL) at Microsoft. Cloud Tuning is today’s cloud reality, and it is substantially more impactful than many would expect. I share some of the lessons we learned, and conclude by observing that Cloud Tuning is much more of a systems / data problem than a modeling one, making efforts from this community particularly relevant.
Bio
Carlo A. Curino received a Bachelor, Master and PhD in Ingegneria Informatica at Politecnico di Milano, and a Master in Computer Science at the University of Illinois at Chicago (UIC).
During the PhD at Politecnico di Milano, he spent time as a visiting scholar at University of California, Los Angeles (UCLA) working with prof. Carlo Zaniolo and prof. Alin Deutsch (UCSD). He then did a Post Doc at CSAIL MIT working with prof. Samuel Madden and prof. Hari Balakrishnan. At MIT he was also the primary lecturer for the course on databases CS630, taught in collaboration with Mike Stonebraker. He spent a year as Research Scientist at Yahoo! Research.
Carlo is currently the Principal Scientist Group Manager leading Microsoft’s Gray Systems Lab (GSL), an applied research group focusing on database, systems, and machine-learning research. At GSL Carlo has lead projects in the areas of systems for machine learning, geo-distributed analytics, big-data, and resource scheduling.