Projects‎ > ‎Magellan‎ > ‎

py_entitymatching

This project seeks to build a Python software package to match entities between two tables using supervised learning. This problem is often referred as entity matching (EM). Given two tables A and B, the goal of EM is to discover the tuple pairs between two tables that refer to the same real-world entities. There are two main steps involved in entity matching: blocking and matching. The blocking step aims to remove obvious non-matching tuple pairs and reduce the set considered for matching. Entity matching in practice involves many steps than just blocking and matching. While performing EM users often execute many steps, e.g. exploring, cleaning, debugging, sampling, estimating accuracy, etc. Current EM systems, however, do not cover the entire EM pipeline, providing support only for a few steps (e.g., blocking, matching), while ignoring less well-known yet equally critical steps (e.g., debugging, sampling). This package seeks to support all the steps involved in EM pipeline.

For Users
  • An initial draft of the how-to guide to do entity matching can be found here.
  • The package is free, open-source, and BSD 3-Clause licensed.
  • The latest version is 0.1.0 (released 12/01/2017)
    • Provides tools for 12 different steps involved in matching entities between two tables using supervised learning.
    • Requires Python 2.7 or 3.4+  (also requires other Python packages, see here for the complete list of dependencies)
    • Has been tested on Linux, OS X, and Windows (more information).
    • To install using conda: execute "conda install -c uwmagellan py_entitymatching", which retrieves the package from conda then installs it.
    • To install using pip: first execute "pip install -U numpy scipy py_entitymatching", which retrieves the package from PyPI then installs it along with its dependencies except PyQt4 . Then to install PyQt4 follow the installation instructions.
    • To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instructions.
    • You can browse source code on GitHub(version 0.1.x). 
    • To start using the package, see the guides in User Manual (single-page version).

For Contributors and Developers
  • How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
  • Source code on GitHub (the master branch)
  • You can always post to the Google group, or email uwmagellan@gmail.com. 
  • Current and future development plan.

For Educators
  • The package has been successfully used by 74 students in CS 784 Spring 2016, a graduate-level data science class at UW-Madison. 
All Documentation
  • An initial version of the how-to guide todo entity matching can be found here.
  • User Manual (including installation instructions and API reference), single-page version.
  • Guides in User Manual containing Jupyter notebooks for end-to-end entity matching.
  • Guides in User Manual containing Jupyter notebooks for each of the steps involved in entity matching.
  • How to Contribute.
Contact 
  • For any questions, you can check the FAQ, post to the Google group, or email uwmagellan@gmail.com.  
  • To contribute, see the section "For Contributors and Developers". 
People and Organizations

Acknowledgment
  • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Comments