py_stringsimjoin

This project seeks to build a Python software package that provides scalable implementation of string similarity joins over two tables, for commonly used similarity measures such as Jaccard, Dice, cosine, overlap, overlap coefficient and edit distance.

For Users

  • The package is free, open-source, and BSD 3-Clause licensed.

  • The latest version is 0.3.1 (released 05/17/2019). Release history

      • Supports joins using 6 similarity measures - cosine, Dice, edit distance, Jaccard, overlap and overlap coefficient.

For Contributors and Developers

Contact

    • For any questions, you can check the FAQ or email uwmagellan@gmail.com.

People and Organizations

Additional Links

    • The internal project page (permission required).

Acknowledgment

    • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.

Related Projects and Resources

An incomplete list of related efforts:

    • py_stringmatching: a project in our group at UW-Madison that builds a Python package consisting of comprehensive and scalable set of string tokenizers and string similarity measures.

    • Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.