Burrows-Wheeler Transform

Advanced Variants and Improvements

The fact that the output of BWT is replete with redundant bytes suggests that it is suitable for a move-to-front (MTF) encoding. In their paper [1994], Burrows and Wheeler tested and recommended the use of MTF for their compression algorithm (BWCA). After MTF, arithmetic coding is the best entropy coder for the job of actual compression.

Deorowicz [2001] proposed a Weighted Frequency Count (WFC) scheme to replace MTF as a “second-stage” operation on BWT’s output. Deorowicz showed that WFC, with the right parameters, can perform MTF coding, i.e., MTF coding is simply a subset of WFC. He cited previous works on this line such as two different forms of move-to-front coding (MTF-1 by Balkenhol, Kurtz, and Shtarkov; and MTF-2 by Balkenhol and Shtarkov). MTF-1 moves the symbol to the first position in the list (index 0) only when it is currently in the second position (index 1). If the symbol is in higher positions (index > 1), it is moved to the second position. MTF-2, on the other hand, moves the symbol from the second position “only when the previous transformed symbol was at the first position.”

While others investigated second-stage algorithms for BWCA, Chapin and Tate [1998] focused on the actual sorting operation of BWT itself. They showed that re-ordering the characters of the alphabet provides superior compression than using the original ordering of the characters (i.e., “abcdefg…xyz”). One particularly better ordering of the characters, they found out, is the alphabet “aeioubcdgfhrlsmnpqjktwvxyz”, with the vowels grouped together.

The program szip, by Michael Schindler, is one of the best compressors that employ block sorting. Working on very large blocks for transformation, szip uses Schindler’s fast “block sorting” method (directly derived from BWT) that suitably preserves local characteristics of transformed strings. The BWT is better in compression, but this variant transform is several times faster [Schindler 1997]. The compression program NanoZip of Sami Runsas likewise use the BWT to top benchmark tests.

Recently, David Scott [2007] has developed a variant named BWTS which, for an input block or string, does not require an index for output, and with a different Last Column than the ordinary BWT. In traditional BWT, the Last column needs an index since the Last Column may refer to different rotations of the original string. In BWTS, there is only one distinct output permutation that corresponds to the original input string, so an index is actually unnecessary.

It is clear that BWTS should improve compression without the block indices. Indeed, the Calgary corpus was reported to have been compressed (using BWTS) to about 800K, clearly nearing the results achieved by the best PPM compressors—and outperforming the best Ziv-Lempel coders.

Though the BWTS method currently requires intensive computations to decode the original block (the original BWT decoding is extremely fast), it is fast enough for general purposes. The idea that block sorting would never need an index is so unprecedented in BWT research that we anticipate it may pave the way for further refinements to BWT—and BWTS.

The BWT is indeed a remarkable and most amazing technique of data transformation. The fact that a data block can be sorted and unsorted using just a single index value leaves the programmer wondering about the many more incoming discoveries and inventions. The researchers and computer scientists in the field will surely find a way to extend the current algorithms, all for the purpose of data compression.

Recent studies and research show that the BWT is also suitable to search operations (on strings or raw byte sequences) while using less storage space according to the theoretical compression performed on the data string. That is, you can do searching while the data block is BWT-sorted. Willets [2003] reports that the search is fast in detecting arbitrary substrings as well as in counting the occurrences of the strings. Such ability to find string patterns in large input text also suggests that less memory is consumed when the block is compressed (maybe using RLE) while accommodating the search functions. These extensions to BWT certainly prove that the transform is not only naturally-inclined to the field of data compression, but also to faster and convenient string-processing.

One thing to note is that, in traditional compression algorithms, the “context” of a particular symbol is always viewed as the previous byte or groups of bytes. In contrast, the BWT context of the byte in the Last Column that we are outputting is comprised of all the other forward-context bytes! It is as if we have already transmitted the symbol even before knowing its incoming context…

However, that idea is misleading since you are really “getting” a larger block of symbols from the data source: you thus have instantly acquired the necessary knowledge or “information” about the forward context for that particular symbol when you read the entire input block. As shown by Scott [2007], an output index is not needed; there is enough “information” in the output block to recover the original string.

This simple observation leads us to speculate that there are many more exciting things to discover when we view the data as a whole, and not look at each information symbol as a different and separate state in the compression process.

References

1.] M. Burrows and D. J. Wheeler, “A Block-sorting Lossless Data Compression Algorithm,” Digital Systems Research Center (SRC) Research Report 124 , May 10, 1994.

2.] M. Schindler, “A Fast Block-sorting Algorithm for lossless Data Compression,” Proceedings of the IEEE Data Compression Conference, 1997; 469.

3.] B. Chapin and S. R. Tate, “Higher Compression from the Burrows-Wheeler Transform by Modified Sorting,” 1998 Data Compression Conference, p. 532.

4.] S. Deorowicz, “Second-step algorithms in the Burrows-Wheeler compression algorithm,” (2001 preprint published in) Software—Practice and Experience, 2002; 32(2):99-111, Copyright © 2002, John Wiley and Sons, Ltd.

5.] K. Willets, “Full-Text Searching and the Burrows-Wheeler Transform,” Dr. Dobb’s Journal, December 2003, pp. 48-52.

6.] David A. Scott’s site advocates “bijective” (or one-to-one) compression for secure file encryption.

Source Code:

BWT.ZIP - shows a fundamental implementation of the Burrows-Wheeler Transform. The traditional qsort() is chosen as the sorting function.

Mark Nelson has written an excellent BWT article in the September 1996 issue of Dr. Dobbs Journal (DDJ). The article also points to an archive (named bwtcode.zip) containing source code for various programs which illustrate BWT's performance against Ziv-Lempel variants like PKZIP compression. The archive includes basic implementations for RLE, MTF, arithmetic coding, and of course, BWT.

Michael Dipperstein has a BWT implementation which uses a radix-2 sorting function for the first two characters of the strings, and then applying quick sort to the remaining blocks, an optimization originally suggested by Burrows and Wheeler.

David Scott modified Nelson's code for his own BWT variant (BWTS) which does not require an output index. The output block of BWTS seems to be more-compressible (i.e., more highly-redundant) than the traditional BWT.

Page updated

Google Sites

Report abuse