py_entitymatching
This project seeks to build a Python software package to match entities between two tables using supervised learning. This problem is often referred as entity matching (EM). Given two tables A and B, the goal of EM is to discover the tuple pairs between two tables that refer to the same real-world entities. There are two main steps involved in entity matching: blocking and matching. The blocking step aims to remove obvious non-matching tuple pairs and reduce the set considered for matching. Entity matching in practice involves many steps than just blocking and matching. While performing EM users often execute many steps, e.g. exploring, cleaning, debugging, sampling, estimating accuracy, etc. Current EM systems, however, do not cover the entire EM pipeline, providing support only for a few steps (e.g., blocking, matching), while ignoring less well-known yet equally critical steps (e.g., debugging, sampling). This package seeks to support all the steps involved in EM pipeline.
For Users
An initial draft of the how-to guide to do entity matching can be found here.
The package is free, open-source, and BSD 3-Clause licensed.
The latest version is 0.3.2 (released 6/5/2019)
Provides tools for 12 different steps involved in matching entities between two tables using supervised learning.
Requires Python 2.7 or 3.5+ (also requires other Python packages, see here for the complete list of dependencies)
Has been tested on Linux, OS X, and Windows (more information).
To install using pip: first execute "pip install py_entitymatching", which retrieves the package from PyPI then installs it along with its dependencies except for PyQt5, XGboost, pandastable, and openrefine. To install these packages follow the installation instructions.
To install using conda refer to the issues page.
To install using the source code, download the code in tar.gz format, then follow the installation instructions.
You can browse source code on GitHub(version 0.3.2).
To start using the package, see the guides in User Manual (single-page version).
For Contributors and Developers
How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
Source code on GitHub (the master branch)
You can always post to the Google group, or email uwmagellan@gmail.com.
Current and future development plan.
For Educators
The package has been successfully used by 74 students in CS 784 Spring 2016, a graduate-level data science class at UW-Madison.
All Documentation
An initial version of the how-to guide todo entity matching can be found here.
User Manual (including installation instructions and API reference), single-page version.
Guides in User Manual containing Jupyter notebooks for end-to-end entity matching.
Guides in User Manual containing Jupyter notebooks for each of the steps involved in entity matching.
Contact
For any questions, you can check the FAQ, post to the Google group, or email uwmagellan@gmail.com.
To contribute, see the section "For Contributors and Developers".
People and Organizations
Matt Christie has been the lead researcher and developer of this project since Feb 2020, with contributions from Jatin Arora.
See the release page for the complete list of contributors for each release.
External collaborators:
Johnson Control Inc.
WalmartLabs
Recruit Institute of Technology
Center for High Throughput Computing (CHTC), UW-Madison
Acknowledgment
We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.