January 29th, 2021

(Add to Calendar)

9:00 START

9:10 KEYNOTE: "The case for sort-based query processing"; Goetz Graefe, Google

10:00 "Early Adventures in Streaming SQL Extensions"; Tyler Akidau, Snowflake

10:20 "Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite"; Mosha Pasumansky & Julian Hyde, Google

10:40 "Automating Data Visualization with Program Synthesis"; Chenglong Wang, UW

11:00 BREAK

11:10 "Order and Consistency at Scale"; Daniel Chia, Google

11:30 "Coroutine-oriented Transaction Execution"; Tianzheng Wang, SFU

11:50 "Streams at Snowflake: A Technical Overview"; Tyler Jones, Snowflake

12:10 LUNCH

1:10 KEYNOTE: "Data Equity Systems"; Bill Howe, UW

2:00 "Bottomless: A Cloud Native Architecture for Mixed Transactional and Analytic Workloads"; Joseph Victor, SingleStore

2:20 "TASM: A Tile-Based Storage Manager for Video Analytics"; Maureen Daum, UW

2:40 "Deriving Real-Time Insights Over Petabytes of Time Series Data with Amazon Timestream"; Sudipto Das, Amazon Web Services

3:00 BREAK

3:10 "Query Optimization for Novel Big Data Systems"; Remy Wang, UW

3:30 "DIAMetrics: Benchmarking Query Engines at Scale"; Stratis Viglas, Google

3:50 "Python at Speed and Scale using Cloud Backends"; Alekh Jindal, Microsoft

4:10 "Delta Engine: Building a modern execution engine for Lakehouse"; Shant Hovsepian, Databricks

4:30 END



Goetz Graefe has worked on database query optimization, query execution, indexing algorithms, database utilities, logging and recovery, and transactional concurrency control. He has published survey papers on query optimization, query execution, sorting, b-tree concurrency control, and b-tree recovery; and monographs on modern b-tree techniques, instant recovery based on write-ahead logging, and transactional concurrency control. He has invented and written the Cascades optimizer framework adopted in many products as well as the exchange operator to encapsulate query parallelism, also found in many products. In 2017, he received the ACM SIGMOD Edgar F. Codd Innovations Award. He graduated from UW-Madison with a Ph.D. in computer science in 1987.


Common wisdom holds that hash-based query execution algorithms are the best choice for unsorted inputs, e.g., intermediate query results, and that any efficient database query processor must include hash-based query execution algorithms, e.g., hybrid hash join and hash aggregation. In contrast, this paper argues that sort-based query execution

1. can be as efficient as hash-based query execution, even for large unsorted inputs;

2. requires less memory, less overflow, and less CPU effort for sorted and partially sorted inputs;

3. exploits sort orders commonly found in index structures, in column stores, and in intermediate query results; and

4. provides operational advantages, e.g., for progress estimation, for resource management, and for pause-and-resume or pause-migrate-and-resume with minimal wasted effort.

In other words, hash-based algorithms are not required for database query processing because equivalent sort-based algorithms, carefully designed and implemented, are always just as efficient and very often more efficient. Parallel external merge sort can serve as the only stop-and-go operation, also known as a pipeline breaker, in a database query execution engine.


Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, machine learning, and visualization, particularly as applied in the physical and social sciences. As Founding Associate Director of the UW eScience Institute, Dr. Howe played a leadership role in the Moore-Sloan Data Science Environment program through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW’s Data Science for Social Good Program. With support from the MacArthur Foundation, NSF, and Microsoft, Howe directs UW’s participation in the Cascadia Urban Analytics Cooperative. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he has authored award-winning papers in conferences across data management, machine learning, and visualization. He has a Ph.D. in Computer Science from Portland State University and a Bachelor’s degree in Industrial & Systems Engineering from Georgia Tech.


As the deployment of automated decision tools in society continues to accelerate, their interactions with fundamental questions in law, in the social sciences, and in public policy have become impossible to ignore. Although the technology holds the promise of reducing costs, reducing errors, and improving objectivity, there is enormous potential for harm. As we train algorithms on biased data, we are amplifying, operationalizing, and, most insidiously, legitimizing the historical discrimination and opacity that technology was in part intended to address.

End-to-end data systems, with DBMS playing a central role, provide abstractions to hide complexity. But as these systems are deployed in real social contexts, the abstractions can reduce transparency and exacerbate equity issues: They make it very easy to do the wrong thing. In this talk, I’ll describe our take on "data equity" and argue that data systems research needs to broaden scope to explicitly model, manage, and communicate assumptions and requirements about the contexts in which they are deployed, making equity issues a first-class design consideration. I'll provide some examples of the tensions, and describe some technical research areas in this space, including learning fair representations across heterogeneous data sources, providing interactive warning labels to communicate fitness for use, and model transferability as it relates to equity.