Magellan
News
The DeepMatcher package which applies deep learning to EM can be found at deepmatcher.ml.
Introduction
Entity matching (EM) is a fundamental problem in data integration. So far the vast majority of EM works have focused on developing EM algorithms. Going forward, we argue that far more efforts should be devoted to building EM systems, in order to truly advance the field. We identify four major problems with current EM systems that prevent them from widespread practical use:
When performing EM users often must execute many steps. Current systems however do not cover the entire EM pipeline, providing support for only a few steps (e.g., matching, blocking).
EM steps often exploit many techniques, e.g., learning, visualization, outlier detection, information extraction, crowdsourcing, etc. Today however it is very difficult to exploit a wide range of such techniques. Incorporating them all into a single EM system is extremely difficult. But moving data repeatedly between an EM system, an IE system, a visualization system, etc. is also equally troublesome and time consuming.
Users often need an interactive scripting environment to write code to patch the system. But few current EM systems provide such a facility.
Finally, and most importantly, in many EM scenarios users often do not know what steps to take. How to start? What should they do next? Current systems provide no how-to guides to help the users navigate the complicated EM process.
Magellan addresses these problems.
It provides how-to guides that tell users what to do in each EM scenario, step by step.
It provides tools that help users address the "pain points" in these steps. The tools seek to cover the entire EM pipeline.
The tools are being built on top of the Python data science and big data eco-system, allowing users to easily exploit a wide range of techniques in learning, visualization, cleaning, etc (as captured in numerous Python packages in this eco-system).
An added benefit of integration with the Python data eco-system is that Magellan has a powerful interactive scripting environment that users can use to prototype code to patch the system.
Magellan is thus an example of a new kind of data management systems that we call "open-world systems", because it relies on many other systems in the eco-system in order to provide the fullest amount of support to the user doing EM. Building such open-world systems raises non-trivial challenges, e.g., designing data structures, meta data management, managing missing values, etc.
Magellan is named after Ferdinand Magellan, the first end-to-end explorer of the globe.
People
Matt Christie has been the lead researcher and developer of this project since Feb 2020, with contributions from Jatin Arora.
Pradap Konda (now at Facebook) managed this project until May 2019. The entire group however was working on Magellan, with different people addressing different aspects of the system.
Collaborators:
WalmartLabs, Johnson Control
Center for Predictive Computational Phenotyping (CPCP), UW-Madison
Center for High Throughput Computing (CHTC), UW-Madison
Publications
Deep Learning for Entity Matching: A Design Space Exploration, S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra. SIGMOD-18.
MatchCatcher: A Debugger for Blocking in Entity Matching, H. Li, P. Konda, P. Suganthan G.C., A. Doan, B. Snyder, Y. Park, G. Krishnan, R. Deep, V. Raghavendra. EDBT-18.
Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., P. Martinkus, A. Doan, A. Ardalan, J. R. Ballard, Y. Govind, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. SIGMOD Record, 2018.
Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16. extended version. (Provides a motivation and overview of the Magellan system.)
Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra. VLDB-16. (A demo of Magellan.)
Toward a System Building Agenda for Data Integration, A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, E. Paulson, P. Suganthan G.C., H. Zhang. ArXiv 2017.
CloudMatcher: A Cloud/Crowd Service for Entity Matching, Y. Govind, E. Paulson, M. Ashok, P. Suganthan G.C., A. Hitawala, A. Doan, Y. Park, P. Peissig, E. LaRose, J. Badger. BIGDAS Workshop @ KDD-17.
Human-in-the-Loop Challenges for Entity Matching: A Midterm Report, A. Doan, A. Ardalan, J. Ballard, S. Das, Y. Govind, P. Konda, H. Li, S. Mudgal, E. Paulson, P. Suganthan G.C., H. Zhang. HILDA Workshop @ SIGMOD-17.
Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services, S. Das, P. Suganthan G.C., A. Doan, J. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park. SIGMOD-17. extended version. (Shows how to scale up an EM DAG that involves complex rules, crowdsourcing, and machine learning.)
Towards Interactive Debugging of Rule-Based Entity Matching, F. Panahi, W. Wu, A. Doan, J. Naughton, EDBT-17
Code & Data
The Magellan system consists of three Python packages (all three have been open sourced):
py_stringmatching: This package implements string tokenizers and string similarity functions.
py_stringsimjoin: Given two large sets A and B of strings, this package efficiently finds all pairs of strings (a in A, b in B) that match. This package uses py_stringmatching.
py_entitymatching: Given two tables A and B, this package finds all pairs of tuples (a in A, b in B) that match. It uses the above two packages.
Many data sets (used to evaluate Magellan and other EM systems) are available in the Magellan Data Repository.
Acknowledgment
We gratefully acknowledge funding from Google, WalmartLabs, and Johnson Control. This project is also supported by NIH BD2K grant U54 AI117924 and a UW2020 grant.