Lempel-Ziv (LZ77/LZSS) Coding

Lempel-Ziv (LZ77/LZSS) Coding


Redundancy of symbols oftentimes pervade a data source, and typically, the redundancy is within a local block of the source. For example, the word “compress” might be seen a couple of times in a particular block and never to be encountered again for a large number of blocks. We can therefore store the “most  recently seen” symbols in a “history” buffer or array, in anticipation that they may occur again very soon in the encoding process. This is the idea behind the algorithm presented by Jacob Ziv and Abraham Lempel in 1977, which became known as LZ77 coding [Ziv and Lempel 1977]. An improved implementation (LZSS) was later described by Storer and Szymanski in 1982.

The history buffer is of definite length, and being so, it seems to “slide” through the file as the algorithm runs. This buffer is thus called a “sliding-window” and the technique a “sliding-window method.” Moreover, a “look-ahead” buffer is created which is used to scan a match between the sliding window.  If there is a match, a pair of codes is transmitted which indicates the string's position (or offset) in the window as well as the length of the match. Thus, the LZ77 algorithm achieves compression by replacing the longest-matching string with a <position, length> code pair.

The pseudo-code for LZ77 coding as described in the comp.compression newsgroup's Data Compression FAQ is as follows:


while ( lookAheadBuffer not empty )

{

get a pointer(position, match) to the longest match

in the window for the lookahead buffer;

if ( length > MINIMUM_MATCH_LENGTH ) {

output a (position, length) pair;

shift the window length characters along;

else {

output the first character in the look ahead buffer;

shift the window 1 character along;

}

}


Observe that the LZ77 algorithm continuously performs a search for a longest match in the window. In terms of computing speed, LZ77 encoding is thus not very efficient due to this extensive pattern matching. A brute-force search is clearly not ideal; employing a faster data structure such as a linked list or a hash table is the usual practice [Williams 1991]. On the other hand, decoding in LZ77 is extremely fast because the decoder can simply access string items in the window using a position code. The compressor and decompressor speeds are therefore non-symmetric.


Length and Position Codes

The sliding window should be large enough to scan a match between local blocks, and the maximum match length is also controlled to minimize the bit codes. A typical size for the window is 4096 bytes which is perfect for a 12-bit window <position> code {POS}. The bit-length of the match could be just 4 bits for 16 possible codes (0..15), zero to indicate that there are no matching characters and the remaining codes (1..15) indicate a match and the corresponding length {LEN}.  All in all, the length of an output code is 16 bits (i.e., {POS = 12} + {LEN = 4} = 16) or two bytes, which suggests that if there is no matching character, we actually encode more bits for just a single character.

One way to overcome this problem when a mismatch occurs is to first output the match length instead of the window position: {LEN + ?}. In case a mismatch occurs, the match length (with a value of 0) is decoded first and you then can just output the actual or literal byte, which is only 8 bits in size. This encoding requires a total code of only 12 bits in size (i.e., {LEN = 4} + {BYTE = 8} = 12), at least saving 4 bits. Whatever the size of the window chosen, outputting first the match length will improve the compression.

LZSS coding. Notice that when there are no matching characters (hence transmitting the match length first plus the exact byte), we are actually sending more zero bits, i.e., code “0000” as output. Storer and Szymanski (1982) greatly improved the LZ77 algorithm by suggesting that we can just output a single prefix bit, bit = 0, to denote that an exact byte follows. The code for a literal byte is  thus shortened to 9 bits (i.e., 1 + 8 = 9 bits), saving three more bits.

Due to this arrangement, we also have to output another prefix bit (bit = 1) when there is a longer match and that what follows are the length and position codes (1 + 4 + 12 = 17 bits). Since 17 bits is greater than 2 bytes, we can limit the position codes to just 11 or 10 bits.

Other LZSS (LZ77) compression schemes accept a minimum match length to avoid transmitting length codes equal to (or more than) 16 bits for a match length of two. The LZRW1 algorithm of Ross N. Williams, for example, uses a minimum match length of three, with a maximum match length of 16 [Williams 1991].  LZRW1 was later modified to “LZRW1-a” with a maximum match length of 18, using the <length> code more effectively. 

Most LZ77/LZSS coding variants indeed operate with a look-ahead buffer size, say L, that is limited by the length code size k to specify the length of the transmitted string.  That is,  L =  2k + m - 1, where m  is chosen as a predetermined (i.e., constant) minimum match length >= 2. The code alphabet B = {0,1,...,2k-1} thus corresponds to the match length set M = {m,m+1,...,L}. In theory, the range of the match length can be expanded without increasing the bitsize k of the emitted length code, but with added computing time for the decoder. By treating the 2nd highest match length (i.e., the “partially matched” string) as the minimum match length m, the maximum match length L is effectively increased. Since m is dynamically varying for all “longest-string” matches, L is therefore also unbounded (see LZGT3A.ZIP below).

Results on the Calgary Corpus show that this simple modification to the LZ77/LZSS  algorithm provides better compression than the "differential" Ziv-Lempel variant (diffLZ) of Fenwick  [1995] which runs  the displacements through an arithmetic coder,  and the LZ coders (LZ77-PM and LZW-PM) of Hoang, Long and Vitter [1999] which use "multiple dictionary selection with PPM-style partial matching." This implicit minimum match length coding is likewise competitive  with LZFG-PM, and the LZ3VL algorithm of Fenwick [1993], achieving compression while still being "pure" LZ (i.e., using the same  sliding-window data structure and LZ77 pattern-matching algorithm), without  running the offset and match length codes through a second-stage  encoder.

Further, the  new method is more resilient to coding or transmission errors. Without error-correction schemes both the traditional and the new LZ77/LZSS method cannot recover the whole string given errors in the <offset> code. On the other hand, if the offset code is perfectly encoded and the <length> code contains errors, the new coding method can recover a larger  part of the string (i.e., of length m >= 2), while conventional LZ77 algorithms may produce ambiguous  strings.

Results also show that  encoding  the match lengths using "unary" codes is very effective for the LZ77/LZSS class of algorithms. Compression is further improved by "folding" the unary codes.  This  solves the LZ77 problem of  encoding the whole  matching string while not assigning longer codes for the lesser matches. The codes are thus "dynamic" according to the size of the string.


Circular Access

A common technique to speed up LZ77 or LZSS compression is to write the symbols in the sliding window using a modular addition operation. Instead of shifting the window LEN characters when a match is found (which is too slow), we just insert to the window position mod N whenever we need to write the new characters, where N is the size of the window. This way, the window acts as a circular array, without ever shifting the characters.

This circular access technique already provides a great speed-up to the basic LZ77 algorithm, and by creating a better search function, the program can perform more efficiently. LZ77 offers excellent compression for binary files, prompting us to improve its performance even more.


Hybrid LZ77 Implementations

In general, data compression programs increase speed as more bytes are compressed (i.e., less I/O is performed) since there are fewer input bytes left to process as well as fewer codewords to write to the output file.

One particular technique in exceeding the limits of the basic LZ77 algorithm is to combine various encoding methods. To improve compression further, output from the LZ77 (LZSS) algorithm is usually passed through a Huffman coder (e.g., in ZIP compression) or arithmetic coding.  With the advent of its “deflate” algorithm ZIP had become and is still the dominant compression system of choice for general purposes.


***


References

1.] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory, May 1977, pp.337-343.

2.] R. N. Williams, “An Extremely Fast Ziv-Lempel Data Compression Algorithm,” Proceedings of the IEEE Data Compression Conference, IEEE Computer Society Press, April 1991, pp. 362-371.


Source Code: 

LZSS.ZIP - includes basic LZ77 (LZSS) coding implementations using linked-list search, as well as a 3-byte hash function with a Boyer-Moore search variant plus second-stage adaptive Huffman coding (Algorithm FGK). If you're a beginner new to LZ77/LZSS, these are the programs to study but you need to complete the "masking" of the lzh and lzhh hash functions in general.


LZGT3A.ZIP - contains simple LZ77/LZSS programs which illustrate that there is another "information" in the transmitted window <position> code, aside from being a mere pointer to the location of the longest string in the sliding window. That "information" refers to the partially matched strings, and clearly demonstrates that LZ77's output strings are of "unbounded" length. In theory, the method improves compression performance of "all LZ77/LZSS algorithms that use a sliding window and output a <position, length> pair of codes." However, decompression speed (the very  hallmark of LZ77 coding) suffers to a large degree due to additional pattern-matching which makes the algorithm a mere theoretical (and empirical) curiosity.


The above archive contains the LZGT1.C program. The program was later modified  into LZGT1A.C, which reserves the last <length> code value  for a *perfect* match. Tests on the "BWTS-ed" Calgary  Corpus files show LZGT1A's better performance for *highly-redundant* files like BWT output streams. By unary coding and "folding" the suffix string, we are able to encode the whole string completely. These basic programs are meant to illustrate LZ77's asymptotic performance, and that we can actually encode longer strings by creating larger look-ahead buffers while maintaining, or being able to shorten, the <length> codes.


LZUF22, LZUF55, LZHHF and LZUF622 and LZUF624 are improved versions of my basic LZ77/LZSS compressors. LZUF55 starts with a small window size to limit bitsize of transmitted offsets but does not achieve better compression than LZUF22 as intended. So in LZ77, big window size is king. LZUF62 improves on LZUF22 with settable sliding window size at execution time. LZHHF uses Golomb codes on the match lengths and dynamic Huffman codes on the literals. LZUF622 and LZUF624 are single file coder/decoders.