Projects‎ > ‎Magellan‎ > ‎

py_stringmatching

This project seeks to build a Python software package that consists of a comprehensive and scalable set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and string similarity measures (such as edit distance, Jaccard, TF/IDF). 


For Users

For Contributors and Developers
For Educators
All Documentation
Contact 
  • For any questions, you can check the FAQ, post to the Google group, or email uwmagellan@gmail.com.  
  • To contribute, see the section "For Contributors and Developers". 
People and Organizations
Additional Links
  • The internal project page (permission required). 
Acknowledgment
  • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Related Projects and Resources 

An incomplete list of related efforts:
  • py_stringsimjoin: a project in our group at UW-Madison that studies how to scale string matching to Big Data.
  • Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.
  • Flamingo: C++ library for approximate string matching.
  • Abydos: NLP/IR library in Python containing string similarity measures.
  • python-Levenshtein: Python extension for computing string edit distances and similarities.
  • editdistance: Python library with fast implementation of edit distance.
  • Jellyfish: Python library for doing approximate and phonetic matching of strings.
  • FuzzyWuzzy: Python library which uses Levenshtein Distance to help calculate differences between sequences.
  • Simmetrics: Java library of similarity and distance metrics.
  • SecondString: Java string matching library by W. Cohen.
  • StringMetric: Scala string matching library.
  • Presentation on String Similarity Measures