Poly'22:

Polystore systems for heterogeneous data in multiple databases with privacy and security assurances

Co-located with VLDB 2022


Introduction:

Enterprises are routinely divided into independent business units to support agile operations. However, this leads to "siloed" information systems. Such silos generate a host of problems, such as:

DISCOVERY of relevant data to a problem at hand. For example: Merck has 4000 (+/-) Oracle databases, a data lake, large numbers of files and an interest in public data from the web. Finding relevant data in this sea of information is a challenge.

INTEGRATING the discovered data. Independently constructed schemas are never compatible.

CLEANING the resulting data. A good figure of merit is that 10% of all data is missing or wrong.

ENSURING EFFICIENT ACCESS to resulting data. At scale operations must be performed "in situ", and a good polystore system is a requirement

It is often said that data scientists spent 80% (or more) of their time on these tasks, and it is crucial to have better solutions.

In addition, the EU has recently enacted GDPR that will force enterprises to assuredly delete personal data on request. This "right to be forgotten" is one of several requirements of GDPR, and it is likely that GDPR-like requirements will spread to other locations, for example, California. In addition, privacy and security issues are increasingly an issue for large internet platforms. In enterprises, these issues will be front and center in the distributed information systems in place today.

Lastly, enterprise access to data in practice will require queries constructed from a variety of programming models. A “one size fits all” model just won’t work in these cases.

At VLDB’18, VLDB’19, VLDB'20 and VLDB'21 we organized the Poly workshop. These successful workshops brought together experts from around the world working on novel advances in the field. Poly’21 will continue to focus on the broader real-world polystore problem, which includes data management, data integration, data curation, privacy, and security.


Keynote talk

Raul Castro Fernandez (University of Chicago)

Data Discovery: Past, Present, and Future



Our ability to extract value from data is limited by our ability to find relevant data sources. Data discovery is the problem of identifying data that satisfies an information need. Hence, solutions to the data discovery problem are crucial to benefiting from the large volumes of data available. In this talk, I will motivate the problem with scenarios from the sciences, open data repositories, and use cases from private organizations. First, I will offer a definition of data discovery to crystallize the challenges addressed so far, the challenges that remain, and others that are arising. Then, I will use the definition of data discovery to walk through the state of the art, offering some of the questions the community of researchers (in data management and beyond) has asked and the ones they have answered. I will conclude by presenting a few open problems and an overview of some of the ongoing efforts in my group. Overall, I expect to convey that addressing data discovery is more important than ever and that although the contributions of the last few years have made significant advances, there is still much to be done, which is exciting for our community.


Biswapesh Chattopadhyay (Meta)

Shared Foundations of Open Data Lakehouse Analytics



Data processing systems have evolved significantly over the last decade, driven by various factors such as the advent of cloud computing, increasingly complexity of applications such as ML, HTAP, Streaming, Observability and Graph processing. However, historically, these frameworks have evolved independently, leading to significant fragmentation of the stack. In this talk, I will talk about how this has evolved in the open source and at Meta, and how we are solving this problem through the Shared Foundations effort, leading to composable systems. This has resulted in significantly better performance, more features, higher engineering velocity and a more consistent user experience.


Research topics:

  • Data discovery from heterogenous data sources (e.g., data lakes)

  • Privacy, Security, and Policy in heterogenous data management.

  • Languages/Models for integrating disparate data such as graphs, arrays, relations

  • Query evaluation and optimization in polystore and other multi-DBMS systems

  • Efficient data movement and scheduling, failures and recovery for polystore analytics

  • High Performance/Parallel Computing Platforms for Big Data

  • Data Discovery, Integration, Cleaning, and Best Practices

  • Privacy and Access control in Polystore and multi-DBMS systems

  • Enterprise support for GDPR and similar privacy regulations

  • Policy implications of GDPR and similar privacy regulations

  • Mathematics for Polystore and other multi-DBMS systems

  • Demonstrations of new tools and techniques for heterogeneous data

Important Dates


July 21st, 2022: Due date for full workshop papers submission

July 30th, 2022: Notification of paper acceptance to authors

August 10th, 2022: Camera-ready

September 9th, 2022: Workshop


Submission page

https://cmt3.research.microsoft.com/POLY2022/