Adaptive Huffman Compression



Dynamic Huffman Coding 2 - Algorithm Vitter

by Gerald R. Tamayo

 

    In 1987,  J. S. Vitter presented a paper on dynamic Huffman coding (see “Design and Analysis of Dynamic Huffman Codes,” Journal of the Association for Computing Machinery, Vol. 34, No. 4, October 1987, pp. 825-845).  In the paper, he describes an algorithm that is better, though not faster, than Algorithm FGK.  He based his results on an extensive analysis of Algorithm FGK and thereby came up with a technique that produces shorter dynamic Huffman codes.

    The method, called Algorithm Lambda, “uses less than one extra bit per letter,” so it is greatly superior than the standard FGK. Vitter suggests a “floating tree” data structure to make the method run in linear time and be fast enough for actual compression work. A new node numbering method, called implicit numbering, is also introduced and is achieved by the floating tree structure.

 

Algorithm "V"

    Vitter’s technique is sometimes called Algorithm V. It is almost the same FGK algorithm but imposes some tight constraints on node “promotion” and node “numbering.”  Thus, the “update” operation for Algorithm V is highly different than Algorithm FGK’s update method.

    The new update procedure further maintains or enforces an invariant (*) in the way of numbering the nodes:

 (*)     For each weight w, all leaves of weight w precede (in the implicit numbering) all internal nodes of weight w.   [Vitter 1987]

 

The “implicit” numbering is the exact visual way the nodes are positioned in the tree: the nodes are numbered from left to right and nodes on the upper level are higher in number than the nodes in the bottom level.

    The leaves have their own blocks as well as the internal nodes. Thus, with the invariant, we can imagine a sequence of leaf L blocks and internal T blocks: LTLTLTLT, with each block of distinct and non-decreasing weights.

     The separation of the leaf and internal blocks is necessary to the efficient functioning of Vitter’s algorithm. This is due to Vitter’s invariant which enforces restrictions on how the nodes are promoted or swapped with other nodes. It is also very important to note that the Vitter technique never executes Algorithm FGK’s Update method; Vitter completely designed a new update method and a novel technique of promoting nodes, called the “slide and increment” procedure.


The New Update Method

     The following pseudo-code is the new Update method of Algorithm Vitter.  It makes heavy use of the SlideAndIncrement procedure, which clearly is a substantial improvement on the FGK algorithm. Also notice that the new method first performs a simple “interchange” (or “swap” as in FGK) if the current node is an “already-seen” symbol. After the interchange, the “slide and increment” operations are performed continuously for the ancestors of the current node:

 

Update_TreeVITTER( accept symbol )

{

    leafToIncrement = NULL;

    p = pointer to symbol’s node;

    if ( p == NULL ){     /* a new symbol! */

       Create two leaf children of the 0-node,

    such that the right child is the new             symbol node and the left child is
        the new 0-node;

    p = parent of the symbol node;

         leafToIncrement = the right child of p;

    }

    else {

        Swap p in the tree with the leader

    of its block;

        if ( p is the sibling of the 0-node ) {

            leafToIncrement = p;

            p = parent of p;

        }

    }

     while ( p != root of the tree ) {

   /* if possible, advance p to the next block. */
        SlideAndIncrement( p );

    }

    if ( leafToIncrement != NULL ) {

        SlideAndIncrement( leafToIncrement );

    }

}

Figure 1.   The New “Update” Method

 

    Notice that, in the new Update method, the main loop constantly promotes the nodes via the SlideAndIncrement procedure.


Slide and Increment

     One major feature of Vitter’s method is the “slide” operation. The method also looks for nodes of equal weight (w) and finds the highest among these nodes. But while Algorithm FGK “swaps” the current and the highest node, the new method explicitly positions or “slides” the current node into the highest node’s position. The highest node, as well as the nodes to its left (which are to the right of the current node), are moved to the left to fill in the former position of the current node, like the way a basic “insertion sort” is performed.

    Another important operation in Vitter’s algorithm (if the node is an internal node), is to “re-slide” the node to the next block of nodes of weight w+1, since the current node’s frequency will always be incremented to w+1. This is done to maintain the invariant that the internal nodes must be higher than the leaves. If not, the block would be ordered pictorially as “[t(w+1), L(w+1), T(w+1)]”, where t is the current internal node.

    As you can see, the current node is lower than the leaf nodes of weight w+1. The re-slide operation moves the current node so as to correct the internal nodes’ block: “[L(w+1)], [t(w+1), T(w+1)]”, making only one block for all internal nodes of weight w+1.  Note that the current internal node is the first node in the block of internal nodes. The re-sliding of the node is what makes Vitter’s technique so powerful. The following is the “slide and increment” operation in pseudo code:

 

SlideAndIncrement ( accept node p )

{

fp = parent of p;

wt = p’s weight;

if ( p is an internal node )

  Slide p in the tree higher than

the leaf nodes of weight wt+1;

else Slide p in the tree higher than

the nodes of weight wt;

p’s weight = p’s weight + 1;

if ( p is an internal node ) p = fp;

else p = new parent of p;

}

 Figure 2.   The “Slide And Increment” Procedure

 

    The purpose of the slide operation, as in FGK, is to promote the current node as the highest numbered equal node in the implicit numbering.

 

Implementation  

    Our algorithm Vitter implementation is similar to algorithm FGK.  Interestingly, if we implement the Vitter algorithm, we shall find that instead of  “the right child of p” to increment (when a new symbol is received), simply re-incrementing Vitter’s “internal 0-node just created” improves the compression ratio to a large degree.  This means that when a new symbol is received, do not increment its frequency the first time, but only its parent’s. This somehow improves compression, beginning in the early stages of the encoding process and thereby effecting a better tree for the succeeding states.

    The above technique is reinforced by the observation in [Cleary and Witten 1984] that an event or symbol that has occurred only once might be considered an “error” or an “anomaly” while an event or symbol that has already occurred twice or more “is likely to be repeated further.”  In this case, we do  not increment the new symbol the first time, but only its parent. But since the parent node is the one re-incremented, it means that the new symbol is also moved along with the parent node. After all, if there is no 0-node, the parent node would be the tree node of the new symbol; incrementing the parent node is therefore equivalent to incrementing the new symbol node. Results show that the modification discussed above greatly improves Algorithm Vitter.

    Note that the Vitter implementation presented here is only intended to be an “equivalent” algorithm. Our re-slide operation for an internal node also achieves Vitter’s invariant. Our code does not use a floating tree suggested by Vitter (i.e., we have no “right” and “parent” arrays; the operations on blocks is achieved via the numbers[ ] array) but the resulting dynamic Huffman codes are exactly the same. In fact, the source code is completely designed and tested to match the encodings that Vitter provided for the following strings: “abc”, “abcd”, “abcc”, and the string “abacabdabaceabacabdfg”.

 

References

1.]   J. S. Vitter, “Design and Analysis of Dynamic Huffman Codes,” Journal of the Association for Computing Machinery, Vol. 34, No. 4, October 1987, pp. 825-845.

2.]           J. G. Cleary and I. H. Witten, “Data Compression Using Adaptive Coding and Partial String Matching,” IEEE Transactions on Communications, Vol. COM-32, No. 4, April 1984.

 

Source Code: 

VITTERGT.ZIP - a highly-readable Algorithm Vitter implementation in C which closely adheres to Vitter's pseudo-code  on Algorithm V.