py_stringsimjoin

This project seeks to build a Python software package that provides scalable implementation of string similarity joins over two tables, for commonly used similarity measures such as Jaccard, Dice, cosine, overlap, overlap coefficient and edit distance.

For Users

The package is free, open-source, and BSD 3-Clause licensed.
The latest version is 0.3.1 (released 05/17/2019). Release history
- - Supports joins using 6 similarity measures - cosine, Dice, edit distance, Jaccard, overlap and overlap coefficient.

- Contains 5 filters - Overlap filter, Size filter, Prefix filter, Position filter and Suffix filter.
- Requires Python 2.7 or 3.5+ and C++ compiler.
- Required dependencies to build the package are pandas 0.16.0 or higher, py_stringmatching 0.2.1 or higher, joblib, pyprind and six (these dependencies will be automatically installed).
- Has been tested on Linux, OS X, and Windows.
- To install using pip: execute "pip install py_stringsimjoin", which retrieves the package from PyPI then installs it.
- To install using conda refer to the issues page.
- To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instruction.
- You can browse source code on GitHub (version 0.3.1).
- To start using the package, read the guides specified in the User Manual (single-page version)
  - (and consult a book chapter on string matching if necessary).
- Frequently Asked Questions

For Contributors and Developers

How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
A book chapter on string matching provides background materials on scaling up string matching.

Contact

- For any questions, you can check the FAQ or email uwmagellan@gmail.com.

People and Organizations

See the release page for the list of contributors for each release.
- External collaborators:
- Johnson Control Inc.
  - WalmartLabs
  - Recruit Institute of Technology
  - Center for High Throughput Computing (CHTC), UW-Madison

Additional Links

- The internal project page (permission required).

Acknowledgment

- We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.

Related Projects and Resources

An incomplete list of related efforts:

- py_stringmatching: a project in our group at UW-Madison that builds a Python package consisting of comprehensive and scalable set of string tokenizers and string similarity measures.
- Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.

Page updated

Google Sites

Report abuse