py_stringmatching

This project seeks to build a Python software package that consists of a comprehensive and scalable set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and string similarity measures (such as edit distance, Jaccard, TF/IDF).

More about the goals of this project.

For Users

The package is free, open-source, and BSD 3-Clause licensed.
This version is 0.4.1 and is the latest (released 02/22/2019). All versions that have been released.
- - Supports 5 tokenizers and 23 string similarity measures.

- Requires Python 2.7 or 3.4+ and C or C++ compiler (also requires numpy and six, but they will be automatically installed).
- Has been tested on Linux, OS X, and Windows.
- To install using pip: execute "pip install py_stringmatching", which retrieves the package from PyPI then installs it.
- To install using conda refer to the issues page.
- To install using the source code, download the code in tar.gz format (for Linux and OS X) or zip format (for Windows), then follow the installation instruction.
- You can browse source code on GitHub (version 0.4.x).
- To start using the package, read the Tutorial in the User Manual (single-page version)
  - (and consult a book chapter on string matching from the book "Principles of Data Integration" if necessary).
- Frequently Asked Questions (FAQs)

For Contributors and Developers

How to Contribute describes the logistics of contributing (e.g., forking code on GitHub, editing documentation).
Developer Manual describes the motivations and goals of the package, design decisions, and planned future work.
A book chapter on string matching provides background materials on string similarity measures.

For Educators

- The package has been successfully used by 74 students in CS 784 Spring 2016, a graduate-level data science class at UW-Madison.
- A book chapter on string matching, ppt slides.

Contact

- For any questions, you can check the FAQ or email uwmagellan@gmail.com.

People and Organizations

See the release page for the list of contributors for each release.
- External collaborators: Johnson Control Inc., WalmartLabs, Megagon Labs, American Family Insurance, Center for High Throughput Computing (CHTC)

Acknowledgment

- We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc, and American Family Insurance. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.

Related Projects and Resources

An incomplete list of related efforts:

- py_stringsimjoin: a project in our group at UW-Madison that studies how to scale string matching to Big Data.
- Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.
- Flamingo: C++ library for approximate string matching.
- Abydos: NLP/IR library in Python containing string similarity measures.
- python-Levenshtein: Python extension for computing string edit distances and similarities.
- editdistance: Python library with fast implementation of edit distance.
- Jellyfish: Python library for doing approximate and phonetic matching of strings.
- FuzzyWuzzy: Python library which uses Levenshtein Distance to help calculate differences between sequences.
- Simmetrics: Java library of similarity and distance metrics.
- SecondString: Java string matching library by W. Cohen.
- StringMetric: Scala string matching library.

Presentation on String Similarity Measures

Page updated

Google Sites

Report abuse