Entity matching (EM) is a fundamental problem in data integration. So far the vast majority of EM works have focused on developing EM algorithms. Going forward, we argue that far more efforts should be devoted to building EM systems, in order to truly advance the field. We identify four major problems with current EM systems that prevent them from widespread practical use:
- When performing EM users often must execute many steps. Current systems however do not cover the entire EM pipeline, providing support for only a few steps (e.g., matching, blocking).
- EM steps often exploit many techniques, e.g., learning, visualization, outlier detection, information extraction, crowdsourcing, etc. Today however it is very difficult to exploit a wide range of such techniques. Incorporating them all into a single EM system is extremely difficult. But moving data repeatedly between an EM system, an IE system, a visualization system, etc. is also equally troublesome and time consuming.
- Users often need an interactive scripting environment to write code to patch the system. But few current EM systems provide such a facility.
- Finally, and most importantly, in many EM scenarios users often do not know what steps to take. How to start? What should they do next? Current systems provide no how-to guides to help the users navigate the complicated EM process.
Magellan addresses these problems.
- It provides how-to guides that tell users what to do in each EM scenario, step by step.
- It provides tools that help users address the "pain points" in these steps. The tools seek to cover the entire EM pipeline.
- The tools are being built on top of the Python data science and big data eco-system, allowing users to easily exploit a wide range of techniques in learning, visualization, cleaning, etc (as captured in numerous Python packages in this eco-system).
- An added benefit of integration with the Python data eco-system is that Magellan has a powerful interactive scripting environment that users can use to prototype code to patch the system.
Magellan is thus an example of a new kind of data management systems that we call "open-world systems"
, because it relies on many other systems in the eco-system in order to provide the fullest amount of support to the user doing EM. Building such open-world systems raises non-trivial challenges, e.g., designing data structures, meta data management, managing missing values, etc.
Magellan is named after Ferdinand Magellan, the first end-to-end explorer of the globe.
- This project is coordinated by Pradap Konda. The entire group however is working on Magellan, with different people addressing different aspects of the system.
Code & Data
- Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16. extended version
- Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16, demo paper.
The Magellan system consists of three Python packages (all three have been open sourced):
- py_stringmatching: This package implements string tokenizers and string similarity functions.
- py_stringsimjoin: Given two large sets A and B of strings, this package efficiently finds all pairs of strings (a in A, b in B) that match. This package uses py_stringmatching.
- py_entitymatching: Given two tables A and B, this package finds all pairs of tuples (a in A, b in B) that match. It uses the above two packages.
We gratefully acknowledge funding from Google, WalmartLabs, and Johnson Control. This project is also supported by NIH BD2K grant U54 AI117924 and a UW2020 grant.