SACHICA++: c++ implementation of sachica (Scalable Algorithm for Characteristic/Homogeneous Interval Calculation)
The algorithm sachica designed by Uno takes as input a set of strings which are the same length, and enumerates all string pairs whose hamming distance is no more than a user-defined threshold. The sachica uses block-wise sorting method for the efficient enumeration.
National Institute of Advanced Industrial Science and Technolog (AIST)
Sachica is free software; you can redistribute it and/or modify it under the terms of the new BSD Licence.
c++ compiler with STL (Standard Template Library)
How to install
% tar -xvzf sachica_ver2.0.tar.gz
% cd sachica_ver2.0
2009-11-15: accerate file read and write2009-11-7 : sachica++ ver2.0 release
./sachica [option] inputfile outputfile
-dist NUM: set maximum hamming distance (default: 1)
-numchunks NUM: set the number of chunks (default: 3)
Each row in input file corresponds to a string. The input strings should be the same length.
An examle for input data is as follows:
The program outputs the pairs of string identifiers into the outputfile.
Here is an example:
Takeaki Uno, An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data, In Proceedings of PAKDD2008, Lecture Notes in Artificial Intelligence 5012, pp. 345-356, 2008
Yasuo Tabei, Takeaki Uno, Masashi Sugiyama, Koji Tsuda: Single Versus Multiple Sorting in All Pairs Similarity Search, The 2nd Asian Conference on Machine Learning (ACML2010), Tokyo, Japan, 2010.