Johannes Gehrke (Microsoft)
Title: AI-Powered DBs and the Future of Software
Our industry is in a big revolution where we are replacing traditional, human-written software, with machine-learned models. This is also happening in the database field where we have initial results on machine-learned query optimizers, indices, and selectivity estimation. I will talk about this new generation of AI-powered database systems and how the practice of software development is changing as a result the AI shift.
Johannes Gehrke is a Technical Fellow at Microsoft in the Experiences and Devices Group, working on machine learning and Big Data. From 1999 to 2015, he was on the faculty in the Department of Computer Science at Cornell University where he graduated 25 PhD students. Johannes has received an NSF Career Award, a Sloan Research Fellowship, a Humboldt Research Award, the 2011 IEEE Computer Society Technical Achievement Award, and he is an ACM Fellow. He co-authored the undergraduate textbook “Database Management Systems (McGrawHill (2002),” currently in its third edition), and he was Program co-Chair of ACM KDD 2004, VLDB 2007, IEEE ICDE 2012, ACM SOCC 2014, and IEEE ICDE 2015.
Feifei Li (Alibaba and University of Utah)
Title: Journey to SDDP: Towards Building a Self-Driving Database Platform
Self-Driving Database Platform (SDDP) provides cloud databases with automatic operation and maintenance, and enables intelligent database kernels (such as adaptive hot/cold data separation). The DBMS running on SDDP has the ability of self-detection, self-recovery, self-decision-making, self-optimization, and serving cloud users with transparent non-stop services. For cloud databases, tuning the buffer size appropriately is critical to the performance, since memory is usually the resource bottleneck. For large-scale databases supporting heterogeneous applications, configuring the individual buffer sizes for a significant number of database instances presents a scalability challenge. Manual optimization is neither efficient nor effective, and even not feasible for large cloud clusters, especially when the workload may dynamically change on each instance. The difficulty lies in the fact that each database instance requires a different buffer size that is highly individualized, subject to the constraint of the total buffer memory space. It is imperative to resort to algorithms that automatically orchestrate the buffer tuning for the entire database instances.
To this end, we enable SDDP to automatically tune buffer sizes based on a deep learning method, called iBTune. It has been deployed for more than 10,000 OLTP database instances in Alibaba production system. Specifically, it leverages the information from similar workloads to find out the tolerable miss ratio of each instance. Then, it utilizes the relationship between miss ratios and allocated memory sizes to individually optimize the target buffer sizes. To provide a guaranteed level of service level agreement (SLA), we design a pairwise deep neural network that uses features from measurements on pairs of instances to predict the upper bounds of the request response times. A target buffer size can be adjusted only when the predicted response time upper bound is in a safe limit. The successful deployment on Alibaba production environment, which safely reduces the memory footprint by more than 17% compared to the original system that relies on manual configurations, demonstrates the effectiveness of our solution.
Feifei Li is currently a Vice President of Alibaba Group, ACM Distinguished Scientist, President of the Database Products Business Unit of Alibaba Cloud and the Database and Storage Lab of DAMO academy. Before joining Alibaba, He has been a tenured full professor at the School of Computing, University of Utah. He has won multiple awards from NSF, ACM, IEEE, Visa, Google, HP, and Huawei. In particular, he is a recipient of the IEEE ICDE 2014 10 Years Most Influential Paper Award, ACM SIGMOD 2016 Best Paper Award, ACM SIGMOD 2015 Best System Demonstration Award, IEEE ICDE 2004 Best Paper Award, NSF Career Award by US National Science Foundation. He has been an associate editor, co-chairs, and core committee members for many prestigious journals and conferences.
Andrew Pavlo (Carnegie Mellon University)
Title: Self-Driving Databases: The Hard Parts
The current research trend is on developing "learned" components to supplement and replace legacy components in database management systems (DBMSs). Such learned components use machine learning (ML) methods to identify non-trivial trends and correlations in the DBMS's runtime behavior. They then use this information to create execution strategies and data structures that are tailored to the application's access patterns. The hope is that learned components will enable new optimizations that are not possible today because the complexity of managing DBMSs has surpassed the abilities of humans. This could then lead to the ultimate goal of achieving a "self-driving" DBMS that is able to configure, manage, and optimize itself automatically as the database and its workload evolve over time. The bad news is that creating such a fully autonomous DBMS is harder than that. The problem requires both holistic systems engineering and novel ML solutions that cannot be solved with just adding learned components to an existing DBMS.
In this talk, I discuss the pressing unsolved problems in self-driving DBMSs. These include how to support training data collection, fast state changes, succinct state and action representations, and accurate reward observations. I will also present techniques on how to build a new autonomous DBMS or the steps needed to retrofit an existing one to enable automated control.
This talk is part of the "My Wife is Pregnant" Speaking Tour. More information is available at https://cmudb.io/tour2019.
Andy Pavlo is an Associate Professor of Databaseology in the Computer Science Department at Carnegie Mellon University.
Yao Lu (Microsoft)
Title: Relational Query Optimization Meets Machine Learning
There has been an increasing trend towards optimizing machine learning workloads on tabular or unstructured (e.g., image and video) inputs. However, key questions remain: (1) is a relational system useful for machine learning? (2) How to apply relational query optimization techniques in these workloads? In this talk, I will try to answer these questions and provide use cases to support relational machine learning.
On the other hand, using machine learning to improve relational query optimization has also attracted great attention. Current solutions provide improvements in different aspects including workload modeling, cost and cardinality estimation, plan selection etc. I will analyze how generalizable these approaches are, how far we are from a workable system, and potential solutions to some open questions.
I am a researcher in the Data Management, Exploration and Mining (DMX) group, Microsoft Research Redmond Lab. Of late, I am working at the intersection of ML and systems towards improved big-data platforms for large-scale machine learning, as well as using machine learning to improve current data platforms. I got my PhD from Paul G. Allen School of Computer Science and Engineering at University of Washington in 2018.
Ryan Marcus (Massachusetts Institute of Technology)
Title: Query Optimization and Deep Learning: Derailing the Hype Train
Recent trends in academic research portray deep learning as a messianic silver bullet: "if there's something that takes numbers in and spits numbers out, it's time to take that something, bury it 6 feet underground, and welcome our savior, the deep neural network, and its bounty of generous Parameters." In this talk, I will highlight recent developments in the machine learning literature -- namely, the idea of inductive bias and biased parameter maps -- that simultaneously (1) explain why "just add deep learning" approaches to systems problems fail, and (2) provide a problem-oriented roadmap for more rigorously applying deep learning techniques to the unique problems of database systems. Then, I will briefly show how this roadmap has been successfully applied in our recent work on Neo, a learned query optimizer. Finally, I will propose a sufficient but not necessary condition for establishing the quality of a particular application of deep learning to systems problems called "the random weights test."
Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska. Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil. Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory. He enjoys long walks down steep gradients, and shorter walks down gentler ones. He's also looking for a job! You can find Ryan online at https://rmarcus.info.
Zeke Wang (ETH Zurich)
Title: How do Database Technologies Help Machine Learning Systems?
It is well known that machine learning training can be benefit from low precision，because machine learning algorithm is always robust to the noisy input data. However, the current machine learning system supports limited precision levels. For example, Google's TPU only supports 8-bit and Cambricon's DianNao only supports 16-bit. In my talk, I will discuss how to teach the new dog (Machine Learning System) old skills (e.g., Database's bit-sliced memory layout), such that machine learning systems allow arbitrary-precision training to fully take advantage of low precision. It turns out that the old skill can work even better on machine learning systems than on the original database systems.
I am currently a post-doc researcher at ETH, Zurich. I am now working on building machine learning systems, with a focus on the FPGA-enhanced computation (e.g., an any-precision accelerator) and FPGA-enhanced communication (e.g., parameter server and MPI collective operations). I will be a tenure-track Assistant Professor at Zhejiang University, Jan. 2020.
Zeyi Wen (National University of Singapore)
Title: ThunderML: Machine Learning Systems on Heterogeneous Architectures
The recent success of machine learning is not only due to more effective algorithms, but also more efficient systems and implementations. We have initiated a project called ThunderML, which aims at offering high performance machine learning as a service to users. So far, we have developed two systems for ThunderML: ThunderSVM and ThunderGBM, both of which exploit Graphics Processing Units (GPUs) and are open source. ThunderSVM supports all the functionalities of LibSVM (including classification, regression, and distribution estimation), and is often 100 times faster than LibSVM. ThunderGBM is for fast Gradient Boosting Decision Tree (GBDTs) and random forests, and is often 10 times faster than XGBoost and LightGBM. Both of them are open-sourced in GitHub and we welcome you to contribute. In this talk, I will present the background knowledge, key techniques and experimental results of ThunderSVM and ThunderGBM. More information about ThunderML can be found at https://github.com/Xtra-Computing/.
Zeyi Wen is currently a research fellow at National University of Singapore, and will join The University of Western Australia as a Lecturer (equivalent to Assistant Professor in US) in September 2019. Before working in Singapore, he was a research fellow at The University of Melbourne from 2015 to 2016, and completed his PhD degree at The University of Melbourne in 2015. His areas of research include systems for machine learning, automatic machine learning, and machine learning for databases.
Jan Kossmann (Hasso Plattner Institute)
Title: Learned Operator Cost Models
Cost estimation of entire query plans and single operators is indispensable for query optimization, resource planning (e.g., to achieve SLAs), and self-managing database systems. Nowadays, database systems are - especially in cloud environments - employed in heterogenous application scenarios featuring different hardware landscapes where diverse workloads on various data sets are processed. This reality renders one-model-fits-all approaches impractical. Therefore, context-specific, custom-made cost estimation models, which can be generated quickly, are needed. In this talk, I will give an introduction to our work on learned operator cost models.
Jan Kossmann is a Ph.D. student at the Hasso Plattner Institute. His research focuses on various topics in the area of self-managing database systems as well as on main memory database systems in general. He is one of the maintainers of the open source system Hyrise. He obtained his B.Sc. and M.Sc. from HPI, too.
Chenggang Wu (UC Berkeley)
Title: Learning from query vs learning from data
Recently, there has been a growing interest in using machine learning for cardinality estimation. These techniques in general fall into one of two categories. Learning from query, where supervised learning is used to learn models from past query traces, and learning from data, where unsupervised learning is used to learn models from data distribution. In this talk, I will introduce the techniques behind both categories as well as their trade-offs.
Chenggang Wu is a Ph.D. student at UC Berkeley working with Professor Joseph M. Hellerstein. His research interests lie in coordination-free distributed systems, consistency models, Serverless infrastructure, and applied machine learning in systems. Prior to joining Berkeley, he obtained his B.S. degree in computer science from Brown University in 2015.