Burrows-Wheeler Transform

A Reversible Sorting Technique


The most important variable in BWT decoding is undoubtedly the row index of the original string in the sorted matrix done by the encoding algorithm. This index will jumpstart the decoding process, correctly recovering the original string. The original string is again shown here for your reference:

What is important to remember in the first place is that the original data to the encoder is the matrix of N individual strings, with their first letter as an actual element in the data block.

The main idea behind BWT is to recognize that the last and first bytes of a cyclically-shifted string are actually consecutive bytes in the original array.

Thus, if the shifted string is “IONCODECOMPRESS”, we can be certain that the last byte of the string is actually its previous byte in the original string. In this string example, the letter ‘S’ (the last byte of the string) is thus the previous byte to the string “IONCODECOMPRESS” in the original array. Conversely speaking, the letter ‘I’ (the first byte of the string “IONCODECOMPRESS”) must be the next byte in the original array (which we are endeavoring to recover):

Therefore, with this knowledge, given just the last byte of a string we can find the next byte as actually positioned in the original array. Since what we have for decoding is indeed an array of these last bytes, it follows that we should be able to recover the first bytes of the strings. Specifically, we can completely recreate the first column of M', without having to know the bytes in between the first and last columns.

As stated earlier we only have the last column and cannot hastily fill in the first column. However, given the nature of the BWT encoding algorithm, we know that the elements of the first column are completely sorted. This knowledge will help us create the first column.

How? As you can see in M', the First column and the Last column contain the same characters, being just different permutations of the original data string. Therefore, since they contain the same characters, we can completely reconstruct the first column by simply sorting the last column! Right. Figure 4 shows the result.

Thus, with the First column created, you can now recover the original string. Given only the row index which points to the original string in this matrix of sorted strings, we can correctly output the original bytes.

Since the row index for this example is 1, you simply output the first byte of the row—the byte ‘C’. What next? The subject of stable sorting now becomes important.

A sorting algorithm is stable if it also preserves the original order of “equal” key items in the unsorted array.Since this ‘C’ is the second ‘C’ in the first column, then—by stable sorting—it must also be the second ‘C’ in the last column! And why would you want to go to its original position (row 9) in the last column? Because, as we have learned, its corresponding byte in the first column is exactly the next byte in the “original” circular array:

This is the reason why the index that BWT outputs—row 1 in this example—exactly points to the original position of the last byte of the original string, which is the letter ‘E’. The row index will quickly reveal the first byte of the string.

 So you quickly jump to row 9 because this is where the second ‘C’ of the last column is found.By the same logic, you output the first byte of this row, which is the letter ‘O’. This letter must also be the second ‘O’ in the last column, so you go to row 6 of the last column, where the second ‘O’ is found. Then you output the first byte of this row 6, which is the letter ‘M’.

Look at the original string and you can see that the entry in the First column is indeed the “next” byte to decode: 

Our recovered string is now the string fragment “COM”. The only letter ‘M’ is found in row 11 of the last column, so you output the letter ‘P’ in the first column of this row:

This process continues until finally, you output the letter ‘E’ to finish the complete recovery of the original string: 

The letter ‘E’ is the first letter ‘E’ so we should process row 1 

but we will not do so because we have already completely recovered the whole string. We will know when to stop since we already have transmitted exactly 15 letters.

When these rows are stacked together, you can clearly read the original string (“COMPRESSIONCODE”) in the First column, from top to bottom:

Therefore, with just the Last column and a row index “indicating” the position of the original string in the sorted matrix of strings (M'), we can recover the whole string perfectly.



Next >>