py_stringmatching
This project seeks to build a Python software package that consists of a comprehensive and scalable set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and string similarity measures (such as edit distance, Jaccard, TF/IDF).
For Users
The package is free, open-source, and BSD 3-Clause licensed.
This version is 0.4.1 and is the latest (released 02/22/2019). All versions that have been released.
Requires Python 2.7 or 3.4+ and C or C++ compiler (also requires numpy and six, but they will be automatically installed).
Has been tested on Linux, OS X, and Windows.
To install using pip: execute "pip install py_stringmatching", which retrieves the package from PyPI then installs it.
To install using conda refer to the issues page.
To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instruction.
You can browse source code on GitHub (version 0.4.x).
To start using the package, read the Tutorial in the User Manual (single-page version)
(and consult a book chapter on string matching from the book "Principles of Data Integration" if necessary).
For Contributors and Developers
How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
Developer Manual describes the motivations and goals of the package, design decisions, and planned future work.
A book chapter on string matching provides background materials on string similarity measures.
For Educators
The package has been successfully used by 74 students in CS 784 Spring 2016, a graduate-level data science class at UW-Madison.
Contact
For any questions, you can check the FAQ or email uwmagellan@gmail.com.
People and Organizations
See the release page for the list of contributors for each release.
External collaborators: Johnson Control Inc., WalmartLabs, Megagon Labs, American Family Insurance, Center for High Throughput Computing (CHTC)
Acknowledgment
We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc, and American Family Insurance. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Related Projects and Resources
An incomplete list of related efforts:
py_stringsimjoin: a project in our group at UW-Madison that studies how to scale string matching to Big Data.
Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.
Flamingo: C++ library for approximate string matching.
Abydos: NLP/IR library in Python containing string similarity measures.
python-Levenshtein: Python extension for computing string edit distances and similarities.
editdistance: Python library with fast implementation of edit distance.
Jellyfish: Python library for doing approximate and phonetic matching of strings.
FuzzyWuzzy: Python library which uses Levenshtein Distance to help calculate differences between sequences.
Simmetrics: Java library of similarity and distance metrics.
SecondString: Java string matching library by W. Cohen.
StringMetric: Scala string matching library.