Data Cleaning & Data Integration

There is no clear boundary between data cleaning and integration. Even though our work addresses both, for brevity, below we will just discuss data integration (DI). Much of the discussion however apply to data cleaning as well.

Overall Agenda

DI has been a long-standing challenge for the data management community. Our work explores new ways to do DI research, build systems, educate students, and do outreach. It also seeks to create a "virtuous cycle" for DI.

Research: Current DI work, with some notable exceptions, studies only a few well-known steps in the DI process, and focuses on developing algorithmic solutions for these steps. For example, for entity matching, most current works study only the blocking and matching steps, and develop increasingly complex algorithmic solutions. Our work however argues that the DI process often involves many other "pain points". Accordingly, we propose a new research style in which we would

    • develop a step-by-step how-to guide that a user can use to execute the end-to-end DI process,

    • examine the guide to identify all pain points (not just those that are currently well-known), then

    • develop solutions and tools for the pain points.

For example, an end-to-end DI problem could be "given these two tables, perform entity matching with precision and recall of at least 90%". A how-to guide for this problem involves steps such as blocking, taking a sample, labeling the sample, selecting and training a good matcher, debugging the accuracy, cleaning, etc. Many of these steps are highly challenging and are true pain points in practice, but have received little attention today. Our work found that solving these pain points is critical for developing practical DI tools, and that addressing them raises many research challenges.

System Building: Our work introduces a new way to build DI systems. Most current DI systems are built as stand-alone monolithic systems, which are very difficult to extend, customize, and combine. We observe that many DI steps essentially perform data analysis, i.e., data science tasks, and that there exist already vibrant ecosystems of open-source data science tools (e.g., those in Python and R), which are being used heavily by data scientists to solve these tasks. Thus, we propose to develop DI tools within such data science ecosystems. This way, the DI tools can easily exploit other tools in the ecosystems, and at the same time make such ecosystems better at solving DI problems.

Compared to current system building practices, this is different in two ways. First, it suggests that instead of building isolated stand-alone systems for DI (the way we built RDBMSs for relational data management), we should focus on building ecosystems of tools. Second, it suggests that researchers working on DI in our community should "connect" with the vibrant and expanding ecosystems of open-source data science tools and build our DI tools directly into those ecosystems.

We then seek to capitalize on these tools to build DI systems for collaborative/cloud/crowd/lay user settings. We also work on fostering an ecosystem of DI tools (as a part of PyData, the Python data science ecosystem).

Education: The new style of research and system building, as described above, suggests a new way to educate and train our students in DI. Today, we teach our students isolated research problems in DI, and ask them to do projects using mostly stand-alone research-prototype DI tools that industry is often unfamiliar with. In this new way, first we teach our students to solve DI problems end-to-end, to identify pain points, and to find or develop new tools to solve the pain points. Thus, they are solving DI problems grounded in practice. Second, we train them in using tools in the open-source data science ecosystems (e.g., PyData), which they are likely to use again in industry. Finally, if we develop new research tools, they will be parts of such ecosystems and thus can be naturally evaluated by students in their class projects.

Outreach: Work in research, system building, and education, as described above, provides a strong foundation for us to do outreach, which we view as a critical part of our overall agenda. In working with domain scientists and companies, we seek to understand the tools that they are using and their pain points, and look for opportunities to deploy and evaluate our tools.

A Virtuous Cycle for DI: Back in the heydays of RDBMSs, our community had a virtuous cycle. There was a clear blueprint for building RDBMSs, as exemplified by System R and Ingres. System building efforts focused on building these systems. Research focused on solving problems arising from building these systems. Courses taught students how to build and use these systems. And companies (e.g., IBM, Oracle, Microsoft) and users also built and used such systems. There was a "virtuous cycle" tying together research, system building, education, and outreach, and at the heart of this cycle was an agreement on how RDBMSs should be built. We believe that this cycle played a major part in quick advances and successes for the field of relational data management. Note that a strong system building focus was key to enabling this virtuous cycle.

So far no such cycle exists for DI. There is no agreement on how DI systems should be built. As a result, research drifts away to developing ever more complex algorithmic solutions. Each group builds (prototype) DI systems in a different way, and students are taught using mostly such prototypes, which industry is largely unfamiliar with.

In this direction we hope to foster a virtuous cycle for DI, in which there is a general agreement that DI tools would be built (in a certain way) as part of the ecosystem of data science tools. This may allow us to tie together research, system building, education, and outreach into a virtuous cycle that allows the field to move far faster and make far more practical impacts. Again, to enable this virtuous cycle, we believe a strong focus on building practical DI systems is critical.

Current Progress

  • Papers and talks describing the overall agenda

  • Research and system development

    • As an example of the new kinds of DI systems that we are building in the context of the above agenda, see the Magellan entity matching management system, as described in the VLDB-16 paper and on the Magellan project homepage. Magellan is being built as a part of the PyData ecosystem.

      • Building on Magellan, we are developing CloudMatcher, a set of services that enable entity matching for collaborative/cloud/crowd/lay user settings.

      • We are partnering with Recruit Institute of Technology to build BigGorilla, a repository of DI tools (as a part of the PyData ecosystems).

    • Education

      • We have used Magellan, BigGorilla, and related tools in several data science classes at UW-Madison: 838 (grad level) and 638 (ugrad level). Students have extensively evaluated our tools (e.g., resulting in the VLDB-16 paper), and have successfully learned to use our tools together with tools in the PyData ecosystem to solve a range of data cleaning and integration problems.

    • Outreach

      • We have been working with several data science teams at UW-Madison and several companies to use and evaluate our tools, and to examine their DI problems and understand their pain points. More will be reported here soon.