py_stringsimjoin
This project seeks to build a Python software package that provides scalable implementation of string similarity joins over two tables, for commonly used similarity measures such as Jaccard, Dice, cosine, overlap, overlap coefficient and edit distance.
For Users
The package is free, open-source, and BSD 3-Clause licensed.
The latest version is 0.3.1 (released 05/17/2019). Release history
Supports joins using 6 similarity measures - cosine, Dice, edit distance, Jaccard, overlap and overlap coefficient.
Contains 5 filters - Overlap filter, Size filter, Prefix filter, Position filter and Suffix filter.
Requires Python 2.7 or 3.5+ and C++ compiler.
Required dependencies to build the package are pandas 0.16.0 or higher, py_stringmatching 0.2.1 or higher, joblib, pyprind and six (these dependencies will be automatically installed).
Has been tested on Linux, OS X, and Windows.
To install using pip: execute "pip install py_stringsimjoin", which retrieves the package from PyPI then installs it.
To install using conda refer to the issues page.
To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instruction.
You can browse source code on GitHub (version 0.3.1).
To start using the package, read the guides specified in the User Manual (single-page version)
(and consult a book chapter on string matching if necessary).
For Contributors and Developers
How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
A book chapter on string matching provides background materials on scaling up string matching.
Contact
For any questions, you can check the FAQ or email uwmagellan@gmail.com.
People and Organizations
See the release page for the list of contributors for each release.
External collaborators:
Johnson Control Inc.
WalmartLabs
Recruit Institute of Technology
Center for High Throughput Computing (CHTC), UW-Madison
Additional Links
The internal project page (permission required).
Acknowledgment
We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Related Projects and Resources
An incomplete list of related efforts:
py_stringmatching: a project in our group at UW-Madison that builds a Python package consisting of comprehensive and scalable set of string tokenizers and string similarity measures.
Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.