Shannon-Fano Coding

Another variable-length compression algorithm deeply related to Huffman encoding is the so-called Shannon-Fano Coding. The technique is similar to Huffman coding and only differs in the way it builds the binary tree of symbol nodes. Thus, it also has to gather the order-0 statistics of the data source. When all the order-0 frequency counts of all symbols are already known, the Shannon-Fano algorithm then creates the tree according to the following steps:

1. Group the symbol nodes into two sets of symbols with almost equal total frequency counts and assign a single bit (0, 1) to each subset;

2. Repeat step 1 to the two newly-generated sets until only one symbol remains in each set. The string of classification bits for each symbol is exactly the Shannon-Fano bitcode for that symbol.

The same symbol encoding process as in Huffman compression is employed for Shannon-Fano coding. The Shannon-Fano algorithm sometimes produces codes that are longer than the Huffman codes. It has long been proven that Huffman coding is more efficient than the Shannon-Fano algorithm in generating optimal codes for all symbols in an order-0 data source.

Again, we provide here a complete C program implementation for Shannon-Fano coding. Implementing the Shannon-Fano tree-creation process is trickier and needs to be more precise in obtaining the correct total frequency counts for the two sets (by sorting the symbol nodes first). The most efficient solution is always the simplest and it came to me by way of a “recursive” function to build the Shannon-Fano tree.

Symbol-Tree Creation

The tree-creation function only looks at three nodes: the parent node, and the two “children” nodes. The parent node will contain the initial list of symbols (stored in the next pointer), and the two children nodes will maintain the new subsets of symbols. Therefore, after this process, the parent node must no longer contain any symbol node, and only the children nodes will have the proper distributed sets of nodes. If any of the children nodes contains more than one node, then we run the same function to this child node “recursively” until only one node remains in each child node.

The task seems simple but the actual distribution process in classifying the two sets needs to be very effective. An efficient method to generate the two subsets is to first sort the initial list of symbols based on the frequency counts.

Consider the usual ascending sorting process: If we always choose the least-frequency symbol first, then the last node will be the highest-frequency node, and is not desirable.

A better alternative is to sort the list in “non-increasing” total frequency counts or, more specifically, in equal/decreasing order, from the highest node to the lowest node. Each symbol node is “popped” from the top of the list and is distributed to the new set. You then enter the symbol node into the subset which has a smaller total frequency count so far. Choosing the smaller child node always ensures that the two subsets are nearly equal, since the last node inserted into the smaller subset will add only minimally to the new set.

This process is repeatedly done until all the new subsets have only one node. The best technique to implement this task for all nodes is to use a recursive function. The said function should be as simple as possible, incorporating the children-creation process, the classification or distribution process, and then the proper linking of the actual symbol nodes to their corresponding parent nodes.

Compact Shannon-Fano Coding Routines

Only slight changes have been made to the Huffman routines to make the Shannon-Fano encoding program work. First is in the insert() function itself which now accepts any list since we now have two separate lists to insert to (and not just one global list). However, since the inserted node might be the “new” starting node of the list, it would be necessary to assign the new value of the list to the list itself. Hence, we may define the insertion function to accept a list as well as the node to insert to, then make it return the possible new value of the list.

The recursive function is very short and expressive (see Shannon.zip). The only difference from the Huffman program is the quick allocation of memory to the top (root node) pointer as well as the proper linking of the created symbol list to this root node’s next pointer, which is where the list is stored.

by Gerald R. Tamayo

Source Code:

SHANNON.ZIP - contains a C implementation of "static" Shannon-Fano coding.