Lempel-Ziv-Welch (LZW) Compression

Lempel-Ziv-Welch (LZW) Compression


In 1978, J. Ziv and A. Lempel introduced the idea of sequentially gathering “phrases” from input symbols [Ziv and Lempel 1978]. Because the algorithm simply accumulates strings of two or more characters, it can run faster during compression, unlike LZ77 which has to perform an extensive search for the longest matching string. No sliding window buffer will be maintained but a “table” of strings. As the algorithm runs, new phrases are added to the string table with a corresponding codeword. When a string is encountered the second time, its codeword or index in the table is transmitted as the output code.

A highly-improved implementation of the LZ78 algorithm was presented in 1984 by T. A. Welch, naming it LZW [Welch 1984]. The Lempel-Ziv-Welch compression algorithm is so designed such that the decoder must be able to duplicate the string table while performing the decompression process. Therefore, no dictionary needs to be prepared in advance or tacked into the compressed file since the decoder is able to perfectly duplicate the string table created by the compression algorithm. This remarkable property of LZW is its most amazing feature.

 

LZW Compression

The LZW coding algorithm keeps on gathering strings and testing if a particular string is already in the string table; if not, the new string is inserted into the table, and the index of the previous string is transmitted as an LZW output code. The lzwcode is usually of size 12 bits, with 4096 possible codes.

Since there might not be a match of at least a two-byte string, we have to output a code which must represent a single byte. Therefore, the first 256 codes (0..255) are reserved for the individual bytes; the non-matching single bytes are thus given 12 bits instead of the usual 8 bits. However, LZW is so effective that the next output codes will represent long strings, enabling better compression.

The pseudo-code for LZW compression is as follows, where <<pattern, K>> is the concatenation of the string pattern and the character K:


pattern = get input character

while ( not end-of-file ) {

K = get input character

if ( <<pattern, K>> is NOT in the string table ){

output the code for pattern;

add <<pattern, K>> to the string table;

pattern = K;

      }

else { pattern = <<pattern, K>> }

}

output the code for pattern

output EOF_CODE

Figure 1.   The LZW Compression Process

 

Let us now employ the above LZW compression algorithm to encode the following data source:


Source String:

YESNOHUFTHUFTHUFYHUFYHUFYHK

 

The first pattern would be the letter ‘Y’ and entering the while loop, we get the letter ‘E’. Since “YE” is not in the string table, we quickly output the code for pattern; at this point, pattern is still the single letter ‘Y’. Then we add the pattern “YE” to the string table (lzwcode = 257) and assign the character ‘E’ as the “new” pattern. The next string pattern “ES” would get a code of 258, and so on. We started with code 257 since the end-of-stream code (EOF_CODE) is assigned the first available code (i.e., code 256) to allow “variable-length” codes. LZW compression then proceeds as shown in the table below:

As you can see from the above table, the entire message will be compressed by the LZW algorithm by transmitting codes for strings. The last column shows the output codes transmitted by the compression program. These output codes are the only codes that will be available to the decompression program.

 

LZW Decompression

The output stream, as shown in the last column of the table, will be the only “data” to the decompression program. The LZW decoder must be able to fill in or reconstruct the string table but, as the last column shows, the decompression algorithm is always one code behind the compression algorithm. For example, after receiving ‘Y’, the decompression program needs to receive first the second code ‘E’ before it can define the string “YE” into the string table, with a code of 257. This process goes on until all the codes are received. Thus, there can be a situation in which the decompression algorithm will receive a code that it has not even defined yet. The last transmitted string code in the table (code 270) precisely demonstrates this instance in a decompression process.

The question mark (?) in the table is shown to illustrate a special case that arises when the decoder encounters a code (to output strings) which is not yet defined (in this case, code 270). This happens when the compression program has just defined a string and immediately sees the same string, thereby also quickly transmitting its code. Because the decompression algorithm is always one code behind, it has not yet defined the code that it just received and hence seems to not be able to translate this new code into the correct string. However, this situation only occurs when a string “PATTERN” (with “PATTERN” already defined previously) appears consecutively in the data source and when a character equal to the first letter of the string (‘P’) immediately follows the second string. The decompression algorithm is particularly designed to handle this situation:


k = PREV_CODE = get input code

output K 

while ( not end-of-file ) {

CURR_CODE = get input code;

if ( CURR_CODE is the EOF_CODE ) break;

if ( CURR_CODE is undefined ) {

pattern = string(PREV_CODE);

pattern = <<pattern, K>>;

}

else { pattern = string(CURR_CODE) }

output pattern;

K = first character in pattern;

add <<string(PREV_CODE), K>> to the 

string table;

PREV_CODE = CURR_CODE;

}

Figure 3.   The LZW Decompression Algorithm


In the source message example, the string segment “HUFYHUFYHUFYH” triggers an undefined code (notice the three superimposed “HUFYH” strings). This happens when the string “HUFY” is first defined as code 268 (i.e., restarting the search at ‘Y’, and defining the string “YH” as code 269), and is transmitted as output when the second “HUFYH” string is encountered by the compression algorithm. Then the algorithm defines code 270 for the pattern “HUFYH”, and starts the search at the last ‘H’ of this newly defined string (the second “HUFYH” string). Thus, it will immediately encounter the third string “HUFYH” and transmit its “newly-defined” code, which is 270.

Complex Decoding. As shown in the table, when the decompression algorithm receives code 268, it will first have to define code 269. After defining code 269, however, it will immediately receive code 270 which it has not even defined yet into the string table. The decompression algorithm handles this by first translating the previous code it has received (i.e., code 268 = "HUFY"), and the first character of the translated string (i.e., the character 'H') is then tacked to this string. Next, the decompressor transmits or “writes”the newly-translated string to the output file, and then enters it into the string table. Thus, the unknown string is always the previous string plus its first character.


Improving LZW Codes

One of the most important  considerations in designing compression programs is that of optimizing the codelengths that are transmitted, since they are the ones that ultimately dictate the compression ratio of the file. The smaller the codewords, the smaller the resulting compressed file.

In line with this, we can improve LZW by assigning smaller codelengths at the beginning of the compression process [Welch 1984]. Since at the start the lzwcode count begins with code 256, we can assign the bit lengths to be 9 bits first (i.e., code 256 already requires 9 bits so this codelength will accommodate integer codes from 0..511), and then add more bits as necessary. Specifically, when you have already “defined” code 512, then you should assign a bit length of 10 bits to properly output the new codes (512..1023).

The “variable-length” LZW coding method indeed requires an “increase in logic complexity” for the encoder and decoder [Welch 1984]. Fortunately, this technique adds to the compression ratio by about 2 to 3%, and is definitely a welcome improvement.

An “accelerated loading” of strings into the dictionary is also possible, with marked improvement in compression [Horspool 1991]. The effect in computing speed is only minimal compared to the benefit of added compression.

For LZW coding, a hash data structure is best in terms of both speed and memory consumption. The Unix compress program (also known as LZC) uses hashing to implement LZW. The simple hash function in LZC (i.e., SHL + XOR) is very good, and the design of the probing function also teaches the reader a great deal about hashing in general. That is, the probing function of any hashing scheme must also be a function of the preceding computed hash value—to take advantage of the computations “already performed” by the first hash function. 


***


References

1.] J. Ziv and A. Lempel, “Compression of Individual Sequences via Variable-rate Coding,” IEEE Transactions on Information Theory IT24, (1978), pp.530-536.

2.] T. Welch, “A Technique for High-Performance Data Compression,” Computer, June 1984.

3.]   R. N. Horspool, “Improving LZW,” Proceedings of Data Compression Conference, DCC ’91, IEEE Computer Society Press, April 1991, pp. 332-341.

 

Source Code:

LZW - shows various LZW implementations via (1) a ternary search tree (TST) data structure; (2) Unix's LZC hashing (lzwh.c); and (3) a binary search tree or BST (lzwgt.c). Lzwh and lzwgt are better and faster than compress 4.3d. Lzwg is new and features optional bitsize (default=16) of dictionary string table size CODE_MAX (default=65536). Lzwg is standard LZW but resets the dictionary after some CODE_MAX+4K codes are transmitted. Lzwhc is a single file LZW codec via hashing, faster than lzwg for larger dictionaries and bigger files. If you want the option to reset the dictionary or not, use lzwz.

Mark Nelson has written about LZW in a Dr. Dobbs Journal article as early as 1989. The source code, which is based on Unix' compress (LZC) program, is also available. Needless to say, I've learned LZW from this article, a rare gem in the data compression literature.