Projects‎ > ‎Magellan‎ > ‎


This project seeks to build a Python software package that consists of a comprehensive and scalable set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and string similarity measures (such as edit distance, Jaccard, TF/IDF). 

For Users
For Contributors and Developers
For Educators
  • For any questions, you can check the FAQ or email  
People and Organizations
  • We gratefully acknowledge financial support from WalmartLabs, Google, Johnson Control Inc, and American Family Insurance. This project is also supported by the Center for Predictive Computational Phenotyping (CPCP), an NIH Center of Excellence for Big Data, on the grant NIH BD2K U54 AI117924.
Related Projects and Resources 

An incomplete list of related efforts:
  • py_stringsimjoin: a project in our group at UW-Madison that studies how to scale string matching to Big Data.
  • Magellan: a project in our group at UW-Madison that builds an end-to-end entity matching management system.
  • Flamingo: C++ library for approximate string matching.
  • Abydos: NLP/IR library in Python containing string similarity measures.
  • python-Levenshtein: Python extension for computing string edit distances and similarities.
  • editdistance: Python library with fast implementation of edit distance.
  • Jellyfish: Python library for doing approximate and phonetic matching of strings.
  • FuzzyWuzzy: Python library which uses Levenshtein Distance to help calculate differences between sequences.
  • Simmetrics: Java library of similarity and distance metrics.
  • SecondString: Java string matching library by W. Cohen.
  • StringMetric: Scala string matching library.
  • Presentation on String Similarity Measures