DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph

Chengxi Ye, Chris Hill1, Shigang Wu2, Jue Ruan2, Zhanshan (Sam) Ma

1Department of Computer Science,  University of Maryland, College Park, MD 20742, USA.

2Agricultural Genome Institute, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong 518120, China.

3Computational Biology and Medical Ecology Lab, State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223 China.


If you have any suggestions or comments, please feel free to contact Chengxi Ye: cxy@umd.edu.

Our work is published in Scientific Reports, if you find our work helpful please kindly cite it:

Ye, C. et al. DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci. Rep. 6, 31900; doi: 10.1038/srep31900 (2016).

http://www.nature.com/articles/srep31900



Abstract

Background

The genome assembly computational problem is preventing the transition from the prevalent second generation to third generation sequencing technology. The problem emerged because the long reads (which is essential for the third generation technology to deliver its potentially huge benefits over the second generation technology) and the high sequencing error rates made the assembly pipeline prohibitive expensive in terms of computational time and memory space consumption.

Methods

In this paper, we propose and demonstrate a novel algorithm that allows efficient assembly of long erroneous reads of mammalian size genomes. Our algorithm converts the de novo genome assembly problem from the de Bruijn graph to the overlap layout consensus framework. We only focus on overlaps composed of reads that are non-contained within any contigs built with de Bruijn graph algorithm, rather than on all the overlaps in the genome data sets. For each read spanning through several contigs, we compress the regions that lie inside each de Bruijn graph contigs, which greatly lowers the length of the reads and therefore the complexity of the assembly problem. The new algorithm transforms previously prohibitive tasks such as pair-wise alignment into jobs that can be completed within small amount of time. A compressed overlap graph that preserves all necessary information is constructed with the compressed reads to enable the final-stage assembly.

Results

We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the sequencing data from the third generation technologies show that our method is able to assemble large genomes much more efficiently than existing methods. On a large PacBio human genome dataset we calculated the pair-wise alignment of 54x erroneous long reads of human genome in 6 hours on a desktop computer compared to the 405,000 CPU hours using a clusters, previously reported by Pacific Biosciences. The final assembly results were in comparably high quality.


How to run the programs:

https://github.com/yechengxi/DBG2OLC/blob/master/Manual.docx?raw=true

Related code and precompiled files:

https://github.com/yechengxi/DBG2OLC

https://github.com/yechengxi/Sparc

https://github.com/yechengxi/SparseAssembler


Comments