Projects‎ > ‎Magellan‎ > ‎

py_stringsimjoin

This project seeks to build a Python software package that provides scalable implementation of string similarity joins over two tables, for commonly used similarity measures such as Jaccard, Dice, cosine, overlap, overlap coefficient and edit distance.

For Users
  • The package is free, open-source, and BSD 3-Clause licensed.
  • The latest version is 0.1.0 (released 07/15/2016)
    • Supports joins using 6 similarity measures - cosine, Dice, edit distance, Jaccard, overlap and overlap coefficient.
    • Contains 5 filters - Overlap filter, Size filter, Prefix filter, Position filter and Suffix filter.
    • Requires Python 2.7 or 3.3+.
    • Required dependencies to build the package are pandas 0.16.0 or higher, py_stringmatching 0.2.1 or higher, joblib, pyprind and six (these dependencies will be automatically installed).
    • Has been tested on Linux, OS X, and Windows.
    • To install using pip: execute "pip install py_stringsimjoin", which retrieves the package from PyPI then installs it.
    • To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instruction.
    • You can browse source code on GitHub (version 0.1.x). 
    • To start using the package, read the guides specified in the User Manual (single-page version
      (and consult book chapter on string matching if necessary).
For Contributors and Developers
All Documentation
Contact 
  • For any questions, you can check the FAQ, post to the Google group, or email uwmagellan@gmail.com.  
  • To contribute, see the section "For Contributors and Developers". 
People and Organizations
Additional Links
  • The internal project page (permission required). 
Acknowledgment
  • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Related Projects and Resources 

An incomplete list of related efforts:
  • py_stringmatching: a project in our group at UW-Madison that builds a Python package consisting of comprehensive and scalable set of string tokenizers and string similarity measures.
  • Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.