Genome Assembly

Exploiting Sparseness in de novo Genome Assembly

Chengxi Ye Zhanshan Ma Chaeles H. Cannon Mihai Pop Douglas W. Yu

The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.

In this work, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers or reads as nodes and the links between these nodes allows the de novo assembly of even moderate genomes (~500M) on a typical laptop computer.

We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers. Related programs and code are available at: http://sourceforge.net/projects/sparseassembler/

Full paper:

http://www.biomedcentral.com/1471-2105/13/S6/S1

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Chengxi Ye, Christopher M. Hill, Shigang Wu, Jue Ruan & Zhanshan (Sam) Ma

The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.

Full paper:

https://www.nature.com/articles/srep31900

See also:

--The latest update of a famous genome assembler that utilizes our sparse k-mer graph strategy.

To understand what genome assembly is and the critical memory issue:

J. C. Venter et al. Science 2001.

N. Nagarajan & M. Pop. Nature Reviews Genetics 2013.

http://www.pacb.com/blog/data-release-54x-long-read-coverage-for/