The Run-Length Transform
Run-length encoding (or RLE) is very effective as a pre-processing stage in the compression process since most files naturally contain succeeding occurrences or “run” of the same byte. Consider the following data source with multiple runs of distinct bytes:
Source String:
WWWWaYYYYbZZZZZeAAAACCCCCCC
As you can see, the above input string is teeming with many runs of different bytes, namely bytes ‘W’, ‘Y’, ‘Z’, ‘A’, and ‘C’. Could we somehow find a method to translate the above source message into just a run of a single byte?
RLT Encoding. A technique which I call the run-length transform (or RLT) is able to convert the above data source into the following string:
NEW String:
W000aY000bZ0000eA000C000000
Now the source string is composed of runs of zeros instead of different bytes. The technique works by first transmitting the encoded byte, and then using just one “run byte” for all the succeeding runs of that byte.
Consequently, it is very useful for a string-gathering compression algorithm like LZW. For example, the original input source above will certainly prompt LZW to create five codes for the runs of bytes W, Y, Z, A, and C while the transformed data will only have a few “defined strings” since LZW is now only seeing the same run of the zero byte, using just one code for these runs of zero bytes. To LZW, the strings "YYY" and "AAA" are now equivalent, i. e., they point exactly to the same string "000".
This implies that LZW can use more of its precious codes for other strings. It is in fact observed to be true for raw image files (e.g., uncompressed BMP files) that contain more runs of colors. Using LZW to the transformed data yields better compression and is indeed found to be very effective for some highly-redundant image files.
RLT Decoding. The above method, however, is coupled with a simple problem: that of a zero/null byte immediately following a run of distinct bytes. For example, if the message string “WWWWW0” will be encoded by this simple method, the encoder will output the string “W00000”. The decoder will then translate this as the string “WWWWWW”, without properly decoding the last zero/null byte of the data source.
Fortunately, the solution to this is also simple: if a null byte is encountered after a run, then just output the encoded byte (‘W’) to indicate that what follows is indeed a real zero byte. Thus, the source string would be encoded as “W0000W”, and the decoder will be able to know that the last ‘W’ is indeed a true zero byte. (Notice that the output string, e.g., “W0000W”, is a palindrome.)
The RLT method is actually an order-0 version of MTF, i. e., single bytes are repeated as output. This new technique is a simpler version of MTF which avoids intensive list-processing operations.
The program that performs this run-encoding technique is named GTRLT.C. The “zero/null byte” can be any byte that you choose and is defined as RUN_CHAR_CODE (see MTF.ZIP). Notice that the GTRLT.C program is patterned after the RLE programs presented earlier, and only differs in the encode and decode functions.
Source Code:
MTF.ZIP - includes an implementation of RLT in C.