Tools

LJA is a new genome assembly algorithm based on de Bruijn graphs designed for PacBio HiFi read assembly. LJA reduces the error rate of (already very accurate) HiFI reads by three orders of magnitude (making them nearly error-free) and constructs the de Bruijn graph for large genomes and large k-mer sizes. Since the de Bruijn graph constructed for a fixed k-mer size is typically either too tangled or too fragmented, LJA uses a new concept of a Multiplex de Bruijn graph with varying k-mer sizes. LJA improves on the state-of-the-art assemblers with respect to both accuracy and contiguity and enables automated telomere-to-telomere assemblies of entire human chromosomes (e.g., 11 chromosomes in human genome were completely assembled by LJA).

Bankevich et al., J. Comp. Biol. 2012, Nurk et al., J. Comp. Biol. 2013, Prjibelski et al., Bioinformatics, 2014

The initial goal of this project was to enable genome assembly from a single cell sequencing data obtained using MDA amplification. However the resulting tool did not just solve the single cell genome assembly problem but also showed excellent results on regular isolate genomes. As a result, SPAdes is now a tool of a choice for genome assembly in 1000s laboratories all over the world and the SPAdes paper (Bankevich et al., J. Comp. Biol. 2012) is the most cited genome assembly paper (more than 10000 times so far).

SPAdes code became a basis for many other genome assembly tools that solve different cases of genome assembly problem including metagenome assembly, RNA assembly, hybrid assembly (using long read and/or synthetic long read technologies). I personally took part in three such projects described below: dipSPAdes, truSPAdes and cloudSPAdes each of which available as a part of SPAdes tool.

Safonova et al., , J. Comp. Biol. 2014

While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.

Bankevich and Pevzner, Nat. Methods 2016

The recently introduced Illumina TruSeq synthetic long read (TSLR) technology generates long and accurate virtual reads from an assembly of barcoded pools of short reads. The assembly of these read pools presents unique algorithmic challenges. We have shown that many of these challenges are similar in nature to the ones present in single cell assembly problem that were addressed by SPAdes. truSPAdes algorithm further optimizes assembly of synthetic long reads from pools of short reads. We showed that truSPAdes produces synthetic superior in length and quality as compared to other assembly tools including Illumina in-house assembler for barcode assembly.

Tolstoganov et al., Bioinformatics 2019

The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly.

cloudSPAdes is a genome assembly tool designed to address the algorithmic challenges of the SLR assembly. cloudSPAdes is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed.

Bankevich and Pevzner, Cell Systems 2018

Reduced microbiome diversity has been linked to several diseases. However, estimating the diversity of bacterial communities—the number and the total length of distinct genomes within a metagenome—remains an open problem in microbial ecology. MetaLen is an algorithm for estimating the microbial diversity in a metagenomic sample based on a joint analysis of short and long reads. Unlike previous approaches, the algorithm does not make any assumptions on the distribution of the frequencies of genomes within a metagenome and does not require a large database that covers the total diversity. We estimate that genomes comprising a human gut metagenome have total length varying from 1.3 to 3.5 billion nucleotides, with genomes responsible for 50% of total abundance having total length varying from only 25 to 61 million nucleotides. In contrast, genomes comprising an aquifer sediment metagenome have more than two orders of magnitude larger total length (≈840 billion nucleotides).

JumboDBG

De Bruijn graphs are traditionally constructed from short reads and thus very small values of k are used. However with the appearance of long accurate reads constructing de Bruijn graph for large value of k became an important problem. JumboDBG solves this problem. It is able to construct de Bruijn graph from reads or genomes using arbitrarily large values of k.

Bankevich and Pevzner, RECOMB 2020

Long-read technologies revolutionized genome assembly and enabled resolution of bridged repeats (i.e., repeats that are spanned by some reads) in various genomes. However the problem of resolving unbridged repeats (such as long segmental duplications in the human genome) remains largely unsolved, making it a major obstacle towards achieving the goal of complete genome assemblies. Moreover, the challenge of resolving unbridged repeats is not limited to eukaryotic genomes but also impairs assemblies of long repeats in bacterial genomes and metagenomes. MosaicFlye algorithm was shown to resolve complex unbridged repeats based on differences between various repeat copies and applied to improve assemblies of bacterial genomes and metagenomes.

Although insecticidal proteins have become an important biopesticide against a wide range of insects, their prolonged use has led to toxin resistance developing in various insect species. Thus, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. However, although IPG prediction in complete genomes is a well-studied problem, their prediction in fragmented genomic/metagenomic assemblies remains challenging. The existing gene prediction tools often assume that each gene is encoded within a single contig in the assembly, a condition that is violated for many IPGs that are scattered through multiple contigs, making it difficult to reconstruct them. The situation is even more severe in shotgun metagenomics, where the contigs are often short, and the existing tools fail to predict a large fraction of IPGs. While it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. Our algorithm uses these observations to predict IPGs in assembly graphs. We applied ORFeus to multiple isolate and metagenomic datasets and discovered hundreds of potential novel IPGs.