A formal version of this work, 'Exploiting Sparseness in de novo Genome Assembly' can be found here:
Chengxi Ye1,2,7, Zhanshan (Sam) Ma2,4, Charles H. Cannon5,6, Mihai Pop7, Douglas W. Yu2,3
1Ecology & Evolution of Plant-Animal Interaction Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Jinghong, Yunnan 666303 China.
2Ecology, Conservation, and Environment Center , State Key Laboratory of Genetic Resources and Evolution, 32 Jiaochang East Rd., Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China.
3School of Biological Sciences, University of East Anglia, Norwich, Norfolk NR47TJ UK.
4Computational Biology and Medical Ecology Lab, Kunming Institute of Zoology, Chinese Academy of Sciences.
5Ecological Evolution Group, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Menglun, Jinghong, Yunnan 666303 China.
6Department of Biological Sciences, Texas Tech University, Lubbock, TX 79410 USA.
7Department of Computer Science and Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA.
Motivation: A major limitation of the de Bruijn graph approach for de novo genome assembly is the large memory requirement for graph construction, particularly for large genomes. In this paper, we introduce a new general model for genome assembly based upon a sparse k-mer graph model, implemented in the SparseAssembler.
Results: We demonstrate that the exhaustive decomposition of reads into all overlapping k-mers is unnecessary. Instead, a sparse graph structure, saving only a sparse subset of observed k-mers and the links between them, greatly reduces the memory space requirements for de novo genome assembly. We traverse this sparse k-mer graph to assemble contigs and ultimately complete the genome assembly. We adopt a Dijkstra-like breadth-first search algorithm to circumvent sequencing errors and resolve polymorphisms. Here, we test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining assembly accuracy, but without sacrificing speed in comparison to existing assemblers.
The source code of SparseAssembler can be downloaded here: http://sourceforge.net/projects/sparseassembler/
Tips for testing SparseAssembler:
k: kmer size, support 15~127 (important).
g: sparseness factor, or number of skipped intermediate k-mers, support 1-25, suggest a variety from 10-25 (important).
f: single end reads data, input filename, input multiple files like f ReadFile1 f ReadFile2 f ReadFile3.
GS: estimated genome size, set a larger value to preallocate space for both the real k-mers and the false k-mers (sequencing errors). (You can set it to 10X real genome size to preallocate space for the errors.)
NodeCovTh: coverage threshold for spurious k-mers, support 0-16, suggest 0-3 (important).
EdgeCovTh: coverage threshold for spurious links, support 0-16, suggest 0-2 (important, set to 0 for low-coverage data).
PathCovTh: coverage threshold for merging bubbles in the breadth-first search, support 0-1000 (important).
LD: 1 to load a previously built sparse k-mer graph, so you can set different values for the above 3 thresholds.
The attached version is not fully tested on paired-end reads but could produce better contigs,
ResolveBranchesPE: 1 to use paired reads to produce longer contigs.
LinkCovTh: threshold for the paired links.
use p1 p2 for paired input files with inward direction, for outward direction use l1, l2.
Because it is a newly developed showing-the-purpose assembler, it can have some problems. If you have further concerns, please contact Chengxi Ye: cxy AT umd.edu.