There exists a great relation between sorting and data compression. You could sort the data file; record the sorted bytes with just their corresponding frequency counts; and somehow restore back the file to its original state. The problem with this idea is how to “unsort” the file. Thus, this is the goal of the programmer: to find a way to represent the sorting task into a minimum number of bits which, ideally, must be smaller than the original file size. When this happens, data compression is achieved. The bit encoding done in the sorting phase would then be needed to decompress the file to its exact original form.
This area has not been studied much, let us presume, that no algorithm has surfaced to solve this problem. Indeed, algorithms books only teach us how to sort an array of integers and never shown us how to unsort the array. Review now the basic sorting algorithms and study how you may “record” each sorting task in the fewest number of bits as possible. Could you use selection sort to achieve compression??? How about shellsort? Or quicksort?
Clearly, total sorting is hard to represent with a smaller number of bits than the original data source but a technique was disclosed that did not promise complete sorting but only a transformation of the source. The output of this transform is not smaller than the original data size; in fact, it is always actually larger in size (by only one integer to signal a block position). However, this transform can sort the file in such a way that the output is highly amenable for a more effective compression. Interestingly, the transform produces many redundant bytes in the output. When the transformation of the data source (usually a fixed-size data block) is finished, a suitable compression algorithm such as RLE, Huffman, or arithmetic coding is then used to compress the output.
This clever transform is called the Burrows-Wheeler Transform (BWT) named after its inventors, Michael Burrows and David Wheeler, who was the one who originally discovered the transform in 1983. They published an article about a data compression algorithm (BWCA) which introduced the transform in 1994, so the transform became known as BWT.
The BWT offered new paths to compression and is best known to be the first block-sorting transform that is reversible allowing data compression. The BZIP2 compression program (by Julian Seward) of the GNU file utilities uses the BWT to achieve better compression than traditional programs that compress files to a ZIP file format.
How does BWT work? To answer that, let us consider the 15-byte data source string “COMPRESSIONCODE”. Figure 1 depicts this string as a block:
Figure 1. The Compression Data Array
The next task is to sort these strings, thereby creating a new sorted matrix (M'). After sorting the strings, the BWT then quickly emits as output the Last column of the sorted matrix. The sorted matrix is shown in Figure 3. We add a single column to show the row indices of the strings, and the First and Last columns are denoted in the figure by the “F” and “L” letters, respectively. In this data example, the Last column is the string “NEODRSOOCCIMPSE”.
Figure 3. The “Sorted” Matrix M'
When the Last column of bytes is transmitted, an “index” is also emitted to denote the original string’s position in the sorted matrix. The original string (“COMPRESSIONCODE”) can be found in row 1 (index = 1), so this value is also transmitted as output.
Amazingly, the Last column as well as the index of the original string are the only output codes of BWT. What is even more amazing is the fact that the Last column is now teeming with consecutive bytes, making it ideal for compression. For very large blocks of data, this last column will be replete with redundant (identical) bytes, hence transforming the original data array into a new string highly amenable for compression.
It is easy to understand the sorting process for the individual strings, but by just looking at the now scrambled Last column, it seems that it is impossible to recover the original string from this array. However, there is indeed a way to “unsort” the Last column given just the index of the original string.The index of the original string certainly refers to an individual byte in the last column, because as you know, the last column is the only data available to the decoder. This individual byte is the byte ‘E’, as found in index 1 of the Last Column. As you shall soon learn, given just this byte as the starting byte, we can recover the whole string completely. The logic of BWT decoding, it turns out, also involves a sorting algorithm.