Projects‎ > ‎

Agenda for DI Systems

Data integration (DI) has been a long-standing challenge in the data management community. So far the vast majority of DI works have focused on developing DI algorithms. Going forward, we argue that far more efforts should be devoted to building DI systems, in order to advance the field. DI is engineering by nature. We cannot just keep developing DI algorithms in a vacuum. At some point we must build end-to-end systems to evaluate the algorithms, to integrate research and development efforts, and to make practical impacts. 

The question then is what kind of DI systems we should build, and how? In this direction we focus on identifying problems with current DI systems, then developing a radically new agenda for building DI systems. These new kinds of DI system have the following distinguishing characteristics:
  • They guide the user through the end-to-end DI workflow, step by step. 
  • For each step, they provide automated or semi-automated tools to address the "pain points" of the step. 
  • Tools seek to cover the entire DI workflow, not just a few steps as current DI systems often do. 
  • Tools are being built on top of a data science and big data eco-system. Today the two most popular such eco-systems build on R and Python. We currently target the Python data science and big data eco-system. 
There are other novelties in this new agenda (such as a distinction between the development stage and the production stage, a focus on serving power users for now, open-world systems vs closed-world systems, etc.). More details will be added in the rest of 2016. While focusing on DI, our agenda can potentially be applied to other kinds of problems in the data science pipeline as well (e.g., data cleaning, data exploration and profiling, information extraction, etc.). 

Current Progress
  • See this talk for the overall agenda (and motivation for it)
  • As an example of the new kinds of DI systems that we are building in the context of this agenda, see the Magellan entity matching management system, as described in the VLDB-16 paper and on the Magellan project homepage.