by TW Lam, WK Sung, SL Tam, CK Wong and SM Yiu
University of Hong Kong & National University of Singapore
Bioinformatics 2008 24(6):791-797.
An exact local alignment algorithm that is much faster than Smith-Waterman-Gotoh. Wow!
This is not clear unless you read it carefully!
Probably, it is fast only for certain scoring schemes (e.g. match score = +1, mismatch score = -3).
Fast local alignment using a suffix tree.
Reduce the memory requirement, by using a Burrows-Wheeler transform.
Find the highest-scoring local alignment of two sequences:
Sequence T has length n.
Sequence P has length m.
Match score = a
Mismatch score = b
Score for a gap of size r = g+r*s
a > 0, b < 0, g < 0, s < 0
Smith-Waterman-Gotoh algorithm. This is a dynamic programming (DP) algorithm, which calculates a score matrix of size m*n. So the time requirement is O(m*n). This is too slow for genome-size data.
To explain this, let's start with a simple but very slow algorithm, and then improve it.
Align P to every suffix of T. For every suffix S, we calculate a DP matrix of size m*|S|. This obviously works, but the time requirement is O(m*n*n).
Only consider alignments that begin at the start of S. This is OK because alignments that do not begin at the start of S will be picked up when we consider other suffixes. To do this, whenever the DP score becomes <= 0, reset it to -infinity. In other words, we are not allowed to re-start alignments past the beginning of S.
"Pruning": Calculate the DP matrix row-by-row, and quit early if all the scores in one row are -infinity. (Each row has m entries.)
Suffix merging: Many of the suffixes probably begin with the same sequence. For example, many suffixes begin with aggct... The first few rows of the DP matrix will be identical for these suffixes. We can avoid repeating these identical calculations.
Example: Suppose two suffixes are tggcac and tggctggcac. First, we calculate the DP matrix for P versus tggcac. Then, we keep the first four rows of the matrix, and fill in the remaining rows for P versus tggctggcac.
When we merge suffixes like this, it looks like a tree, so this is called a "suffix tree".
It's hard to tell. Probably, it works better when the mismatch and gap costs are high compared to the match score, so that pruning happens earlier.
The article reports that it is "at least 1000 times faster" than Smith-Waterman-Gotoh, with a match score of +1 and a mismatch score of -3. It also claims that, for gapless alignment (match score = +1, mismatch score = -3, gap score = -infinity), the time requirement is O(m * n^0.628).
Standard data structures for suffix trees take a lot of memory, making genomic applications difficult.
A suffix array holds the suffixes of the text in alphabetical order. If we take the suffixes in alphabetical order, the "suffix merging" described above will work efficiently. For example:
Text: BANANA$
Suffix array: 1,3,5,0,2,4,6
Memory requirement: 4n bytes (assuming each number requires 4 bytes). This is less than for traditional suffix tree data-structures, but it is still a bit high.
A Burrows-Wheeler transform (BWT) is closely related to a suffix array. Instead of using numbers, it uses the preceding letter of each suffix. For example:
Text: BANANA$
BWT: BNN$AAA
How can we use the BWT? It's a bit complicated. Suppose we know that suffixes starting with gcca are at positions [i..j) within the suffix array. As explained below, we can deduce the positions [p..q) where suffixes starting with Zgcca are in the suffix array. (Z = one letter.)
If p < q, then Zgcca exists in the text.
If p = q, then Zgcca does not exist in the text.
Since this method goes backwards (gcca -> Zgcca), we make a Burrows-Wheeler transform of the reversed sequence. Using this method, we can get all the suffixes that exist in the text in alphabetical order (a -> aa, aa -> aaa, etc.)
Method detail: How can we deduce [p..q) from [i..j) ?
p = C(Z) + Occ(Z, i)
q = C(Z) + Occ(Z, j)
C(Z) = number of letters in the text alphabetically smaller than Z.
Occ(Z, i) = occurrences of Z in [0..i) of the BWT.
C(Z) can be pre-calculated.
Occ(Z, i):
If we precalculated it for all i, we would need lots of memory!
So, we precalculate it for i = multiples of 128 (for example).
When i != multiple of 128, we need to count Z in the BWT.
Memory requirement: 2n bits (assuming each letter requires 2 bits). Also, some memory for C(Z) and Occ(Z, i). Note: we do not need the suffix array, and we do not even need the original sequence!
This BWT algorithm is attributed to: P Ferragina & G Manzini, "Opportunistic data structures with applications", FOCS (2000) 390-398.
Is it important to use less memory (e.g. Burrows-Wheeler transform), or will computer memories get much bigger in the near future?
Can we get an even faster algorithm by making suffix trees of both sequences, not just one sequence?
Can this approach help with other slow algorithms (e.g. multiple alignment, motif discovery, RNA algorithms, HMMs)? (It seems promising for spliced alignment.)