py_entitymatching

This project seeks to build a Python software package to match entities between two tables using supervised learning. This problem is often referred as entity matching (EM). Given two tables A and B, the goal of EM is to discover the tuple pairs between two tables that refer to the same real-world entities. There are two main steps involved in entity matching: blocking and matching. The blocking step aims to remove obvious non-matching tuple pairs and reduce the set considered for matching. Entity matching in practice involves many steps than just blocking and matching. While performing EM users often execute many steps, e.g. exploring, cleaning, debugging, sampling, estimating accuracy, etc. Current EM systems, however, do not cover the entire EM pipeline, providing support only for a few steps (e.g., blocking, matching), while ignoring less well-known yet equally critical steps (e.g., debugging, sampling). This package seeks to support all the steps involved in EM pipeline.

For Users

  • An initial draft of the how-to guide to do entity matching can be found here.

  • The package is free, open-source, and BSD 3-Clause licensed.

  • The latest version is 0.3.2 (released 6/5/2019)

      • Provides tools for 12 different steps involved in matching entities between two tables using supervised learning.

For Contributors and Developers

  • How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).

  • Source code on GitHub (the master branch)

    • You can always post to the Google group, or email uwmagellan@gmail.com.

    • Current and future development plan.

For Educators

    • The package has been successfully used by 74 students in CS 784 Spring 2016, a graduate-level data science class at UW-Madison.

All Documentation

  • An initial version of the how-to guide todo entity matching can be found here.

  • User Manual (including installation instructions and API reference), single-page version.

  • Guides in User Manual containing Jupyter notebooks for end-to-end entity matching.

  • Guides in User Manual containing Jupyter notebooks for each of the steps involved in entity matching.

  • How to Contribute.

Contact

    • For any questions, you can check the FAQ, post to the Google group, or email uwmagellan@gmail.com.

    • To contribute, see the section "For Contributors and Developers".

People and Organizations

  • Matt Christie has been the lead researcher and developer of this project since Feb 2020, with contributions from Jatin Arora.

  • See the release page for the complete list of contributors for each release.

Acknowledgment

    • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.