The Data Compression Guide



HOME

Introduction

  • Source coding

Prefix-free Coding

Dynamic Prefix-free Coding

Dictionary-based Compression

Data Transformation Methods

Research, Sites

 




Prefix-free Coding

 

The most basic class of compression algorithms employs a symbol-tree to create distinct codes for all symbols of the source message. As such, the symbol-tree is also referred to as a code-tree. Using a tree is practical since a “complete” binary tree can represent all the possible codes for the known symbols, in which the symbol nodes are all “leaf” nodes in the tree. For example, the Extended ASCII alphabet (of size 256) can exactly fit all its symbols in one complete binary tree.

Visualizing the code-tree offers immediate intuitive appeal to the data compression researcher. He notices that the codes do not need a “prefix” set of bits to distinguish them. The sequence of left and right edges of the binary tree, labelled 0 and 1 respectively, can represent the codeword for a chosen symbol since there is a distinct path from the root of the tree down to the symbol. Codes that have this one-to-one mapping of “symbols to codes” are called prefix-free codes or simply prefix codes. Since no symbol is a prefix of some other codes, all the symbols are immediately decodable.

Prefix-free coding methods have been devised that take advantage of this code-tree mapping. We can build a tree in such a way that the “more-probable” symbols are placed near the root of the tree, while allowing the less-frequent ones to occupy the bottom of the tree. Further, dynamic methods are also possible since the code-tree can simply be “updated” at each encoding step and still provide a correct mapping of the symbols based on their frequency counts.

 The most well-known of the prefix-free algorithms is Huffman coding [1952]. Classified as a “static” order-0 algorithm, it is still dominant and applied in most compression packages and file archivers. Prior to Huffman’s method was Shannon-Fano coding which creates symbol codes by starting to build the tree from the top down. Huffman’s technique was the exact opposite: build the bottom nodes first before creating the upper nodes.

 All static algorithms (e.g., Shannon-Fano and Huffman coding) have three phases [Gagie 2004]. The first phase involves reading the source message and recording the frequency counts of the individual symbols in the message. Second, the coding method builds its code-tree. The idea is to create a code-tree that optimizes the “placement” of symbols in the tree: it is important that the most-frequent symbols be assigned smaller codelengths than the less-frequent symbols. The third phase is the actual encoding of the message using the code-tree.

 Decoding in static methods has three phases too. First, we read the order-0 statistics of the source message. Next, we build the code tree. And finally, we decode the compressed file. Since the codewords are prefix-free, the decoder is guaranteed that it will always arrive at a leaf node which corresponds to a given symbol.