19. April 16.45 Uhr

The periodic table of data structures and the path toward self-designing data systems

Data structures are everywhere. They define the behavior of modern data systems and data-driven algorithms. For example, with data systems that utilize the correct data structure design for the problem at hand we can reduce the monthly bill of large-scale data systems applications on the cloud by hundreds of thousands of dollars. We can accelerate data science tasks by being able to dramatically speed up the computation of statistics over large amounts of data. We can train drastically more neural networks within a given time budget, improving accuracy.

However, knowing the right data structure and system design for any given scenario is a notoriously hard problem; there is a massive space of possible designs while there is no single design that is perfect across all data, queries, and hardware scenarios. We will discuss our quest for the first principles of data structures and data system design. We will show signs that it is possible to reason about this massive design space, and we will show early results from a prototype self-designing data system which can take drastically different shapes to optimize for the workload, hardware, and available cloud budget. These shapes include data structure and system designs which are discovered automatically and do not exist in the literature or industry.

Stratos Idreos


Stratos Idreos is an associate professor of Computer Science at Harvard University where he leads the Data Systems Laboratory. His research focuses on making it easy and even automatic to design workload and hardware conscious data structures and data systems with applications on relational, NoSQL, and data science problems. For his PhD thesis on adaptive indexing, Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award from the European Research Council on Informatics and Mathematics. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems and in 2020 he received the ACM SIGMOD Contributions award. Stratos is also a recipient of the National Science Foundation Career award, and the Department of Energy Early Career award.

21. Juni 16.30 Uhr

Deep Entity Matching: Challenges and Opportunities

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences.

In this talk, I will present Ditto, which is an example of a modern entity matching system based on pretrained language models. I will also summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task and discuss some challenges and opportunities for further exciting work in this area.

This talk is based on work with Yuliang Li, Jinfeng Li, Yoshihiko Suhara, Jin Wang, and Wataru Hirota while the speaker was at Megagon Labs.

Wang-Chiew Tan


Wang-Chiew is a research scientist at Facebook AI. Prior to joining Facebook AI, she led the research efforts at Megagon Labs with the goal of building advanced technologies to enhance search by experience where her team conducted research on data integration, information extraction, text mining and summarization, knowledge base construction and commonsense reasoning, and data visualization. Prior to that, she was a Professor of Computer Science at University of California, Santa Cruz. She also spent two years at IBM Research - Almaden. Her research interests include data integration and exchange, data provenance, and natural language processing. She is a co-recipient of the 2014 ACM PODS Alberto O. Mendelzon Test-of-Time Award, the 2018 ICDT Test-of-Time Award, and the 2020 Alonzo Church Award. She received the 2019 VLDB Women in Database Research Award. She was on the VLDB Board of Trustees (2014-2019) and she is a Fellow of the ACM.