Author: Martin C. Frith
Paper by E. Myers
Workshop on Algorithms for BioInformatics (Wroclaw, Poland 2014), 52-67
The larger aim is to develop a genome assembler for PacBio DNA reads. PacBio reads are promising for assembly, because:
They are long.
Their errors are allegedly near-random (i.e. non-systematic).
They are allegedly an unbiased (uniform) sampling of the genome.
The difficulty is that their error rate is high (12-15%).
Assembly requires finding overlap alignments between the reads. This study seeks local alignments, which are more general than overlap alignments, and may help deal with artifacts (vector sequence, chimeras, etc.)
The approach is to seek alignments that exceed a minimum length (typically 1000 or 2000) and %identity (e.g. 70).
The data is too big to process all at once, so it is split into "blocks". We compare each pair of blocks, and seek alignments between the reads in block A and the reads in block B.
The method (DALIGN) has 2 steps: seed finding and alignment extension.
The aim is to get pairs of reads that share several k-mers (e.g. 14-mers).
Specifically, we seek read pairs with shared k-mers that cover >= h (e.g. 35) bases. This means that bases in overlapping k-mers are not double-counted.
Moreover, the k-mers must be in nearby diagonals. (Diagonal = read1 coordinate minus read2 coordinate.)
Make a list of all k-mers in block A and their positions. Each list item contains 3 things:
(k-mer, read number, read coordinate)
These are encoded as one 64-bit integer (presumably, 32 bits for the k-mer and 16 bits for the read number and coordinate).
Make a similar list for block B.
Sort the lists. (So they will be in k-mer order.)
Do a merge-join, to get all pairs of same k-mers in A and B, and put their positions in a new list:
(read number A, read number B, read coordinate A, read coordinate B)
These are also encoded as one 64-bit integer.
Sort the new list.
In the final sorted list, all shared k-mers for each pair of reads are adjacent, and in order of their A coordinate. It is now easy to get read pairs satisfying the criteria.
For sorting, a threaded radix sort is used. Specifically, it is an LSD (least significant digit) radix sort, and out-of-place (uses an extra array of equal size to the input). It uses an apparently-novel loop fusion optimization (fuse two loops into one).
Comment: another paper ("Engineering Radix Sort for Strings" by J Kärkkäinen & T Rantala) couter-intuitively optimizes radix sort by loop fission.
The aim is to extend an alignment starting from coordinates (i, j) in two reads. Forward and reverse extensions are done separately.
The method is based on a previous algorithm by Myers, which finds longest extensions with up to d differences:
Find the longest extension with zero differences (i.e. gapless extension to the 1st mismatch).
Find 3 longest extensions with one difference (mismatch, insertion, or deletion).
Find 5 longest extensions with two differences (one per diagonal).
Etc.
At each step, we find the longest extension ending on each diagonal. (With d differences, we can reach 2d+1 diagonals.)
The new method modifies this algorithm, by heuristically trimming (ignoring) unpromising extensions. There are 2 trimming criteria:
Remove points for which the last C (e.g. 60) alignment columns have < M (e.g. 33) matches.
Remove points > L (e.g. 30) antidiagonals behind the furthest point.
Criterion 1 is implemented by using length-C bit-vectors: 1s and 0s are shifted in for matches and differences.
The extension ends when it hits the end of a sequence, or when all points are trimmed by criterion 1. In the latter case, another criterion is needed to choose a good alignment endpoint. That criterion is:
The furthest point where every suffix of the last E (e.g. 30) columns has >= X %identity.
The implementation is a bit involved, and uses pre-computed lookup tables.
Test 1: generate a 1Mb random sequence, generate errors at rate e in 2 copies of it, and try to align them.
The alignment %identity is greater than (1 - 2e). [Why? Back-mutations? Alignment ambiguity?]
Sensitivity depends critically on M being small enough. For PacBio, M = 55% seems good.
Test 2: try the extension algorithm on 2 random sequences.
Fast termination depends on M not being too small. As M tends to ~48%, random DNA will align.
Test 3: 40X coverage of a random 10Mb genome.
"It works".
Test 4: real genomes. These are highly repetitive, so:
Overly-frequent k-mers were suppressed.
Low-complexity sequence was soft-masked.
Result: DALIGN is much faster and more sensitive than BLASR. [But BLASR was not designed for exactly this task.]
I like how it uses classic, simple, excellent methods (e.g. radix sort). But the new extension method seems kludgy, and there are some questions:
The focus on %identity is questionable. PacBio has unequal rates of substitution, insertion, and deletion, suggesting it would be better to use different costs for them. Maybe that would be too slow, maybe not, the paper does not say.
It is strange that almost all the paper considers random, non-repetitive sequences. Aren't repeats the main problem?
The alignment extension method is reminiscent of BLAST's gapped x-drop algorithm, and especially of MegaBLAST's fast version [Z Zhang et al. J Comput Biol. 2000 7:203-14]. The MegaBLAST algorithm seems simpler, and it maximizes a score, so every suffix automatically has positive score - this doesn't have to be kludged in at the end. My biggest criticism is the lack of any comparison to or mention of this alternative.
I'm always saddened when the existence of LAST is completely ignored, though I'm pretty used to it by now...