Adaptive Huffman Coding

Faller, Gallager, Knuth = Algorithm FGK

by Gerald R. Tamayo


The standard adaptive Huffman algorithm is called Algorithm FGK.  Faller (1973) and Gallager (1978) designed the first versions and Knuth (1985) improved the algorithm.  In 1987, J. S. Vitter presented a new method of generating dynamic Huffman codes. His method is theorized to produce minimal Huffman codes in bits. Vitter proves that his technique will, at the worst case, generate approximately < S + t bits, while Algorithm FGK produces approximately 2S + t bits, where S  is the static Huffman’s bit output and t  is the size of the input message [Vitter 1987].

In algorithm FGK, the Huffman tree is dynamically recomputed when necessary to maintain an efficient code tree. At any state in the decompression process, the decoder can only receive codes that have already a mapping with the source symbols, and there are two sets of symbols in this case: (1) the set of “already-seen” symbols and (2) the set of “unseen” symbols. Thus, the encoder must somehow relay to the decoder a code which signifies an unseen symbol.


A Zero-Weight Huffman Node

The intuitive approach of algorithm FGK is to create an extra node to the tree to represent the set of unseen symbols; this node is called the 0-node.  When a new symbol is encountered, this node is sent as output to indicate a new symbol, and then the new symbol is also emitted as a raw byte. The 0-node is appropriately named because its weight is zero; as such, it has always the longest code. Its sole purpose is to “signal” to the decoder that a new symbol follows.

  The 0-node is created when the algorithm starts and is always maintained by creating two children nodes from the 0-node: one node becomes the new symbol’s node and the other becomes the new 0-node. The first 0-node is the root, and the new symbol node and the new 0-node branch off from this root node.

This 0-node idea is indeed the best approach to rebuild the Huffman tree each time a new symbol is seen, since it allows very fast processing of the tree. The following is an example of the first tree (with only one symbol, and the 0-node at the left), which suggests that there was a four-byte “run” of the symbol ‘A’:

Figure 1.   The Zero Node


As the algorithm runs, the tree will become large but always with the zero node at the very bottom level of the tree and at the “left” of the smallest node. However, if we just proceed on adding new symbol nodes from the 0-node, then the symbol tree will be a linear, “unbalanced” binary tree. This means that most of the symbols will have very long codes, and will not produce an effective symbol tree.


The Sibling Property

  In dynamic coding, it is not enough to just have a symbol tree: the tree must be a “correct” Huffman tree. Thus, the tree is recomputed or updated to ensure a correct tree. Recomputation of the tree is done dynamically, and the decoder also maintains the same tree that the encoder creates.

  Algorithm FGK maintains a property to create a compact tree as much as possible. This is called the Sibling Property. Accordingly, the Sibling property ensures a Huffman tree in the fastest manner as possible; recomputation of the tree always maintains the Sibling property.

  The Sibling Property defines a binary tree to be a Huffman tree if and only if:

 

The ideal numbering of the nodes starts from the very bottom level of the tree and proceeds from left to right, and going upward level by level. The level numbering is not always achieved by Algorithm FGK [Vitter 1987]; but the siblings’ consecutive ranking and the parent’s higher rank over its children ensure a correct tree, imposing the “sorted” order (according to frequency count) of all the nodes.


Updating the Huffman Tree

  With the Sibling property, nodes are promoted up the tree when necessary to assign them a minimal number of bits; that is, if they are gaining weight, they are assigned shorter bit codes. Some nodes may go down according to the statistics of the source; if one node goes up, there is certainly one node which goes down because the two nodes are simply swapped in the tree positions. Node promotion involves constant swapping of nodes.

Basically, the FGK algorithm states that a received symbol’s node (current node) must be promoted (via the Update method) as the highest  numbered node among nodes that are of equal  weight. Simply swap the two nodes (i.e., the highest numbered node and the current node which must now become the new highest numbered node) and the tree maintains the Sibling property, as well as ensuring a correct tree. The parent of the current node then becomes the new current node, and the process continues until the root of the tree is reached.

  The node numbering can be compared to the list created by the static Huffman algorithm in which the nodes are sorted according to their weights, from the lowest to the highest count. In this instance, however, we always maintain the node “numbers” so as to keep track of the highest node that is of equal count with the current node.

  The pseudo-code for Algorithm FGK (which is essentially the Tree Update method) is as follows:

 

Update_TreeFGK ( accept symbol )

{

    p = pointer to symbol’s node;

    

    if ( p == NULL ){  /*  a new symbol!  */

    Create two leaf children of the 0-node,

      such that the right child is the new symbol

      node and the left child is the new 0-node;

    p = parent of the new symbol node;

    }


    /*  if the two nodes are siblings.  */

    if ( p’s parent == 0-node’s parent ){

    Swap p in the tree with the highest

      numbered “leaf” of equal weight;

    p’s weight = p’s weight + 1;

    p = parent of p;

    }

    

    while ( p != root of the tree ){

    Swap p in the tree with the highest

      numbered “node” of equal weight;

    p’s weight = p’s weight + 1;

    p = parent of p;

    }

}


Figure 2.  Algorithm FGK’s “Update” Procedure



Swapping the nodes involves their “parent” pointers as well as the “children” pointers of their parents. If they have the same parent (which happens when you have to move the current node to the right), then obviously you have to swap only their parent’s children pointers.

However, when the current node is the sibling of the 0-node, then the swapping of the current node should be performed only with a “leaf” node and not with an internal node (only one internal node is of equal count with the 0-node’s sibling: the sibling’s own parent!), so it suffices only to verify if the tested node is not the 0-node’s and the sibling’s parent.

Compression tests.  We conducted empirical tests on different kinds of files, e.g., text files, binary files, and those of the Calgary Corpus. Algorithm FGK performs better than the static Huffman algorithm in almost all files. However, bear in mind that the weights are still probabilistic  (i.e., uncertain) in dynamic Huffman coding, as compared to the static Huffman algorithm in which the set of weights truly represent the actual symbol probabilities for the whole data source. But since Algorithm FGK transmits shorter codes than static Huffman coding, we can say that Algorithm FGK easily “adapts” to the incoming symbols and somehow predicts the appearances of the symbols with amazing accuracy.

Adaptive Huffman algorithms are not considered very fast and, in fact, algorithm FGK is shown by its authors to be theoretically slower than the static Huffman algorithm at the worst case. The way the 0-node is created and maintained makes algorithm FGK extremely fast. Certainly, the main advantage that can be expected from a dynamic Huffman algorithm is the compression performance. As our implementation shows, a dynamic Huffman algorithm creates smaller compressed files for almost all kinds of files, outdoing the static Huffman algorithm.

  Our implementation here is a worthwhile example of dynamic Huffman encoding; it shows that a file need not be read first to get the source file statistics. Moreover, the frequency table is not necessary (i.e., it is not attached in the compressed file's header) since the encoding is done dynamically. Hence, the encoder and the decoder dynamically maintain the same compression model for the data source.

  This principle of maintaining the same compression model by the encoder and the decoder is one of the most important principles in the field of data compression. As long as the encoder and decoder maintain the same model (i.e., the same statistics and data), some very interesting functions can be performed by the algorithm and provide better compression results. More advanced algorithms maintain more sophisticated models (e. g., arithmetic coding and DMC) to achieve very good compression performance.

The Algorithm FGK compression and decompression programs are named FGKC.C and FGKD.C, respectively (see FGK.ZIP). They also use the include files ADHUF.C and FGK.C.  These adaptive Huffman implementations are the reason why we need exactly H_MAX * 2 + 1 node addresses in HUF.C, because we have to allocate for the 0-node and its parent.


References

1.)    D. E. Knuth, “Dynamic Huffman Coding,” Journal of Algorithms 6 (1985), pp. 163-180.

2.]   J. S. Vitter, “Design and Analysis of Dynamic Huffman Codes,” Journal of the Association for Computing Machinery, Vol. 34, No.4, October 1987, pp. 825-845.


Source Code:

FGK - a highly-readable FGK implementation in C (fgkc.c) which closely adheres to standard descriptions of the algorithm.