K-mers in genome assembly

Introduction

When running a modern genome assembler you are frequently required to select a k-mer size (or range of k-mer sizes) that is most appropriate for your assembly problem. Choosing the value of this parameter is the subject of vigorous debate among practitioners, in no small part due to the fact that the principles underlying the use of k-mer-based approaches in assembly are poorly understood. Here I attempt to clarify some of these principles.

Let's start with the beginning. Assembly is necessary because the sequences "read" by a sequencing machine are commonly much shorter than the segments of DNA being sequenced. Assembly is the process that takes a shredded version of the DNA and "glues" together the shreds into a reconstruction of the original DNA (see genome assembly primer for more details). The assembly process is simply impossible if the fragments of DNA do not overlap each other. To ensure the fragments overlap, it is necessary to sequence the DNA multiple times over, leading to each base from the original DNA being present, on average, in c different reads. The variable c is called coverage and usually has values larger than ~5 (i.e., one sequences about 5 times more DNA than is contained in the sequence being reconstructed).

Do fragments overlap exactly?

Making sure that the fragments overlap is necessary but not sufficient for assembly. The assembler must be able to identify these overlaps, and do so efficiently. Forgetting about efficiency for a moment, let us focus on what is necessary to find overlaps. Assume that, on average, two sequences overlap by exactly one character. This information is clearly useless for assembly, as every sequence will share its last character with the first or last character of roughly half of the rest of the sequences. Even deep learning cannot untangle the resulting mess. In other words, for assembly to be possible, the overlap between sequences must be long enough to ensure it can be distinguished from overlaps that occur simply by chance.

Note that we are only focusing at this point on the true overlaps between reads assuming that we know exactly where the reads fit in the genome. We'll revisit this point a bit further down. The length of these true overlaps is determined by the depth of coverage. At 1X (or 1-fold) coverage (i.e. if we sequence as much DNA within reads as there is in the sequence being reconstructed) we would expect the reads to not overlap at all, rather butt end to end along the sequence and we wouldn't be able to reconstruct the genome. In reality, the randomness of the sequencing process leads to some reads overlapping by quite a bit, with large gaps occurring between others. This phenomenon can be analyzed through Poisson statistics, and was detailed by Eric Lander and Mike Waterman in Genomic Mapping by Fingerprinting Random Clones.

What is the impact of errors?

Let's consider what happens when the sequencing technology is not perfect. Assume we are looking at an overlap of exactly 4 basepairs and try to decide whether it is "real", i.e., distinguish it from possible random matches. There are exactly 4⁴=256 different DNA strings of length 4, i.e., only 1 in every 256 random sequences may generate an equally good overlap. Assume now that one error is possible within this same 4-letter long segment. To account for this error, the assembler will only require that 3 out of the 4 letters match, i.e., one in every 64 random sequences may confuse the process.

As a result, longer overlaps are necessary, requiring higher depth of coverage.

A second impact of errors is that overlaps become a lot harder to identify. When only looking for exact overlaps, we can very efficiently find them using a number of indexing strategies (e.g., suffix trees, suffix arrays, BWT, etc.). These algorithms run in linear or sub-linear time, meaning that the runtime is proportional to the number of bases in the set of reads. Once errors become part of the equation, the algorithms used to find the overlaps become much more expensive. In the worst case, these algorithms run in quadratic time (proportional to the square of the number of bases in the set of sequences). In the best case, the runtime is amplified by the error rate.

K-mers take 1

Efficiency/runtime is a major consideration when reconstructing sequences from tens to hundreds of millions of reads. One strategy for ensuring efficiency even when errors are present is to rely as much as possible on exact matches, and only perform further checks for a small set of the sequences. For convenience, algorithms usually rely on exact matches of fixed length k, leading to the concept of k-mers: substrings of length k from the DNA sequence being assembled.

The value of k, overlap length (and, implicitly, depth of coverage), and error rate are directly connected. Exact matches of length k within an overlap of size o can only be found if k is small with respect to the average distance between errors (o / num_errors).

Repeats

k-mers take 2

Metagenomics

Conclusions