AIDB@VLDB 2020 - Invited Talks

Invited Talks

Keynotes

Andrew Witkowski, Rafi Ahmed & Murali Thiyagarajan (Oracle)

Title: Oracle Autonomous DB: Challenges and Some Solutions

Abstract:

New Oracle Autonomous Database provides automation of tasks that were typically performed by DBAs or experienced user. This includes Auto-Indexing, Auto-Materilized Views, Auto-Zonemaps, and Auto-Partitioning. These autonomous tasks eliminate the need for DBAs so users can concentrate on running their applications rather than on tuning them. In this talk we will elaborate on tools that were needed to automate the task, overview of the features including their main challenges, and usage of Machine Learning technology applied in some of their implementations. In addition, we will elaborate on some challenging future Database areas where automation could be very beneficial. Areas will include management of database and automatic performance improvements of series of sql statements generated by typical multi-statements reports.

Bio:

Andrew Witkowski is a Vice President in Oracle Corporation. He holds a Ph.D. in computer science and manages top layer of query processing including Optimizer, Execution of SQL Statements, External Tables, Parallel Query, Oracle procedural language PL/SQL, Materialized Views and On-Line Redefinition. He has worked on many SQL extensions including Analytic Functions, SQL Spreadsheet, SQL Pattern Matching, Multi-Dimensional Zonemaps and External and Hybrid Partitioned Tables. He has published several papers in SIGMOD and VLDB conferences. He has 57 US patents including 8 pending ones. Previously he worked at Teradata and Jet Propulsion Laboratory.

Rafi Ahmed holds an M.S. in mathematics and a Ph.D. in computer science. Currently, he is a Consulting MTS at Oracle Corporation. He has worked in the areas of temporal databases, version management, heterogeneous databases, query transformation and optimization, external partitioned tables, and materialized view rewrite. He has published many research papers in database journals and in data engineering, SIGMOD and VLDB conferences. He has written an article in an encyclopedia and several chapters in database books. He has 28 U.S. patents. He previously worked at Informix and Hewlett-Packard Laboratories.

Tim Kraska (MIT)

Title: Towards Learned Algorithms, Data Structures, and Systems

Abstract:

All systems and applications are composed from basic data structures and algorithms, such as index structures, priority queues, and sorting algorithms. Most of these primitives have been around since the early beginnings of computer science (CS) and form the basis of every CS intro lecture. Yet, we might soon face an inflection point: recent results show that machine learning has the potential to alter the way those primitives or systems at large are implemented in order to provide optimal performance for specific applications.

In this talk, I will provide an overview on how machine learning is changing the way we build systems and outline different ways to build learned algorithms and data structures to achieve “instance-optimality” with a particular focus on data management systems.

Bio:

Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory and co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL). Currently, his research focuses on building systems for machine learning, and using machine learning for systems. Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley . Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the VLDB Early Career Research Contribution Award, the VMware Systems Research Award, the university-wide Early Career Research Achievement Award at Brown University, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.

Ippokratis Pandis (AWS)

Title: Practical use of machine Learning in Amazon Redshift

Abstract:

Machine learning is an excellent tool that can be used to scale out operations and improve the efficiency, performance and cost of database systems and managed data management services. In this talk we are going to talk about Amazon Redshift, Amazon’s petabyte-scale data managed warehouse service, and different ways where machine learning is useful and being used by this service.

Bio:

Ippokratis Pandis is a senior principal engineer at Amazon Web Services working in Amazon Redshift. Redshift is Amazon’s fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as member of the research staff at the IBM Almaden Research Center. He received his PhD at Carnegie Mellon University.

Invited talks

Bailu Ding (Microsoft Research)

Title: Autonomous index tuning

Abstract:

Index tuning has been crucial for database performance. State-of-the-art index tuners rely on query optimizer’s cost estimates to search for the index configuration with the largest estimated execution cost improvement. Due to well-known limitations in optimizer’s estimates, in a significant fraction of cases, an index estimated to improve a query’s execution cost, e.g., CPU time, makes that worse when implemented, i.e., the query regresses. Such errors are a major impediment for automated indexing in production systems.

In this talk, I will give an overview of autonomous index tuning and describe two techniques to improve index tuning without causing query regressions.

We observe that comparing the execution cost of two plans of the same query corresponding to different index configurations is a key step during index tuning. Instead of using optimizer’s estimates for such comparison, our key insight is that formulating it as a classification task in machine learning results in significantly higher accuracy. We present a study of the design space for this classification problem. We further show how to integrate this classifier into the state-of-the-art index tuners with minimal modifications. Our evaluation using industry-standard benchmarks and a large number of real customer workloads demonstrates up to 5× reduction in the errors in identifying the cheaper plan in a pair, which eliminates almost all query execution cost regressions when the model is used in index tuning.

While the machine learning model works well when there is sufficient execution statistics, we observe that the model's performance degrades significantly when the test data diverges from the data used to train these models. We address this performance degradation by using B-instances to collect additional data during deployment. We propose an active data collection platform, ADCP, that employs active learning (AL) to gather relevant data cost-effectively. We develop a novel AL technique, Holistic Active Learner (HAL), that robustly combines multiple noisy signals for data gathering in the context of database applications. HAL applies to various ML tasks, budget sizes, cost types, and budgeting interfaces for database applications. We evaluate ADCP on both industry-standard benchmarks and real customer workloads. Our evaluation shows that, compared with other baselines, our technique improves ML models’ prediction performance by up to 2× with the same cost budget. In particular, on production workloads, our technique reduces the prediction error of ML models by 75% using about 100 additionally collected queries.

Bio:

Bailu Ding is a Senior Researcher at Microsoft Research. She has worked on transaction processing, index tuning, and query optimization. Her recent work includes using machine learning techniques for index tuning and leveraging query processing techniques to improve query optimization quality.

Zongheng Yang (UC Berkeley)

Title: Self-supervised learning for cardinality estimation

Abstract:

Cardinality estimation is a critical component in nearly every database system and query engine. Research has suggested that our current cardinality estimators yield low accuracy, while improved estimates can result in faster queries and lower resource usage by sizable margins.

This talk introduces a new angle of attack---deep self-supervised learning---for cardinality estimation. We describe two systems that we worked on in the past year, Naru and NeuroCard. Naru is a single-table estimator that captures all possible correlations among columns, without making any independence assumptions. It learns by simply reading tuples in a purely unsupervised fashion. NeuroCard builds on top of Naru and enables independence-free estimation for joins. Compared to prior approaches, both estimators provide orders of magnitude more accurate estimates for challenging multidimensional range queries. We describe the two systems, the techniques that enable them, and discuss future directions.

Bio:

Zongheng Yang is a fourth-year PhD student in the RISELab at UC Berkeley. He works on applying advances in deep learning to systems, with a current focus on making query engines more efficient. In the past, he worked at Google Brain, and interned at Microsoft Research, Databricks, and Twitter.

Taize Wang (4Paradigm)

Title: Towards the Real-World AI Applications Deployment: Feature Engineering and Database

Abstract:

At present, popular machine learning frameworks, such as TensorFlow and PyTorch, rarely focus on feature engineering and lack best practices for deploying feature engineering in production. However, feature engineering essentially plays an important role of applying machine learning models into real-world applications. As a result, companies will face the following problems when deploying AI models into production. First, it may cost expensive R&D resources to develop and deploy an AI model in practice. Second, inconsistency of offline and online features leads to poor model performance. Third, processing thousands of complex online features leads to poor online performance, which is unable to meet real-time demand for many real-world applications. To this end, we will introduce our experience to address those issues as well as one of our core AI infrastructure - RTIDB. RTIDB is designed to close the gap between model training and model deployment. It is particularly optimized for efficient feature engineering based on our rich experience of AI application deployment to thousands of customers. Furthermore, we have adopted specific techniques such as skip list and multi-level indexing to improve the performance.

Bio:

Taize Wang is a team leader of the Feature Engineering Storage Group in 4Paradigm, who is the inventor, architect and maintainer of the 4Paradigm's core AI storage infrastructure - RTIDB. RTIDB is specially designed to optimize storage efficiency for AI applications, to tackle the data storage and query challenges of offline-online data consistency, high throughput and low latency. RTIDB has been deployed in thousands of customers for over 10,000 AI scenarios, including DHL, Budweiser, Friso, Yum China, etc. Before joining 4Paradigm, Taize Wang was a senior researcher and engineer in Baidu. He is a major contributor to Baidu's web search engine.