SACHICA++: c++ implementation of sachica (Scalable Algorithm for Characteristic/Homogeneous Interval Calculation)

Introduction


The algorithm sachica designed by Uno[1] takes as input a set of strings which are the same length, and enumerates all string pairs whose hamming distance is no more than a user-defined threshold. The sachica uses block-wise sorting method for the efficient enumeration.


Author


Computational Biology Research Center
National Institute of Advanced Industrial Science and Technolog (AIST)

Download


Sachica is free software; you can redistribute it and/or modify it under the terms of the new BSD Licence.


Source

Installation


Requirements
   c++ compiler with STL (Standard Template Library)
How to install
  % tar -xvzf sachica_ver2.0.tar.gz
  % cd sachica_ver2.0
  % make

News


2009-11-15: accerate file read and write
2009-11-7 : sachica++ ver2.0 release

Usage


./sachica [option] inputfile outputfile
option:
 -dist NUM: set maximum hamming distance (default: 1)
 -numchunks NUM: set the number of chunks (default: 3)

Format of input data


Each row in input file corresponds to a string. The input strings should be the same length.
An examle for input data is as follows:

ATGCTATCGTCTCTGCTGC
ATGCTAGGGTCTCTGCTGC
CAAGCGAACGCGAGCTGGC
CGCGCGGATGCGATGGCGG
GGGCCGCGAGCGCGAGCGC
CGCGCGGAGGGGAGGGCGG

The program outputs the pairs of string identifiers into the outputfile. 
Here is an example:
0 1
3 4
2 3
2 4
3 5
.
.


Reference


[1]Takeaki Uno, An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data, In Proceedings of PAKDD2008, Lecture Notes in Artificial Intelligence 5012, pp. 345-356, 2008
[2]Yasuo Tabei, Takeaki Uno, Masashi Sugiyama, Koji Tsuda: Single Versus Multiple Sorting in All Pairs Similarity Search, The 2nd Asian Conference on Machine Learning (ACML2010), Tokyo, Japan, 2010.
ċ
sachica_ver2.0.tar.gz
(154k)
Yasuo Tabei,
Nov 19, 2009, 3:59 AM
Comments