The first task in the Burrows-Wheeler Transform, for a block of size N, is to create exactly N strings out of the block. This means that the individual strings will begin at each element of the block, and the block is to be seen as a circular array, with some strings going past the end of the block and restarting again at the beginning of the array. The initial data is thus a matrix M of the “cyclic shifts” or rotations of the original string (as shown in Figure 2):
The next task is to sort these strings, thereby creating a new sorted matrix (M'). After sorting the strings, the BWT then quickly emits as output the Last column of the sorted matrix. The sorted matrix is shown in Figure 3. We add a single column to show the row indices of the strings, and the First and Last columns are denoted in the figure by the “F” and “L” letters, respectively. In this data example, the Last column is the string “NEODRSOOCCIMPSE”.
When the Last column of bytes is transmitted, an “index” is also emitted to denote the original string’s position in the sorted matrix. The original string (“COMPRESSIONCODE”) can be found in row 1 (index = 1), so this value is also transmitted as output.
Amazingly, the Last column as well as the index of the original string are the only output codes of BWT. What is even more amazing is the fact that the Last column is now teeming with consecutive bytes, making it ideal for compression. For very large blocks of data, this last column will be replete with redundant (identical) bytes, hence transforming the original data array into a new string highly amenable for compression.
It is easy to understand the sorting process for the individual strings, but by just looking at the now scrambled Last column, it seems that it is impossible to recover the original string from this array. However, there is indeed a way to “unsort” the Last column given just the index of the original string.
The index of the original string certainly refers to an individual byte in the last column, because as you know, the last column is the only data available to the decoder. This individual byte is the byte ‘E’, as found in index 1 of the Last Column. As you shall soon learn, given just this byte as the starting byte, we can recover the whole string completely. The logic of BWT decoding, it turns out, also involves a sorting algorithm.