The Data Compression Guide




  • Source coding

Prefix-free Coding

Dynamic Prefix-free Coding

Dictionary-based Compression

Data Transformation Methods

Research, Sites 


Dictionary-based Compression


Most of the early algorithms consider only symbol-by-symbol (or order-0) encoding. A dictionary-based compression algorithm, on the other hand, focuses on groups of symbols or “strings” instead of individual bytes, making it a variable-context (order-k) model encoding. These strings are stored in a lexicon or a dictionary, and assigned distinct and consecutive integer codes. The dictionary can be one very-large array of symbols or an actual table of distinct strings.

Lempel-Ziv coding. The first of these dictionary-based algorithms generally emerged from the work of Jacob Ziv and Abraham Lempel. They created a superior compression algorithm which is now called LZ77, based upon their initials and the year they created it [Ziv and Lempel 1977]. This is the next major advance in data compression history after 1952’s Huffman coding. Indeed, 25 years had to pass after Huffman invented his technique just to arrive at this excellent “sliding-window” method. Decades later, LZ77 is still the main technique in the field of data compression since it is very effective for most kinds of files.

The LZ77 algorithm employs a fixed-length window buffer array which stores the actual bytes of the source file. Since this buffer cannot contain the whole file, it is imagined to “slide” through the file as the algorithm runs. The window is used to scan a match with an advanced “look-ahead” buffer (LAB) which holds the “pattern” or the next symbols to be encoded.

The window and  the look-ahead buffer  both appear—and slide—consecutively in the message. If a match is found, the index or position of the match in the window is emitted as well as the length of the match. Thus, two codes are used for output in this algorithm and extra care must be given in choosing the length of these codes.

Lempel-Ziv-Welch compression. A year later, LZ78 was invented and it was a great improvement to the LZ77 algorithm. The novel idea in the LZ78 algorithm is to abandon the sliding-window and instead create a dictionary of distinct strings while the algorithm runs. The popular implementation of LZ78 was invented by Welch [1984].  The algorithm is named “LZW”, adding Welch’s initial.

 The LZW algorithm is very powerful and fast.  A typical LZW implementation creates N dictionary strings, where N is a power of 2. Compression codes are assigned to each string as they are being encountered. Each time a string of symbols matches, the k-bit code is transmitted, i.e., k = log2(N).  Thus, LZW is a significant improvement in the Ziv-Lempel class of algorithms because it considers complete strings instead of just symbol-positions as in LZ77. More data compaction is possible if the input is buffered in blocks with a new dictionary employed for each block.