The most basic class of compression algorithms
employs a symbol-tree to create
distinct codes for all symbols of the
source message. As such, the symbol-tree is also referred to as a code-tree. Using a tree is practical
since a “complete” binary tree can represent all the possible codes for the known
symbols, in which the symbol nodes are all “leaf” nodes in the tree. For example, the Extended ASCII alphabet (of size 256) can exactly fit all its
symbols in one complete binary tree.

Visualizing the code-tree offers immediate intuitive
appeal to the data compression researcher. He notices that the codes do not need a “prefix” set of bits to
distinguish them. The sequence of left and right edges of the binary tree, labelled 0 and 1 respectively, can
represent the codeword for a chosen
symbol since there is a distinct path
from the root of the tree down to the symbol. Codes that have this one-to-one mapping of “symbols to codes” are called
prefix-free codes or simply prefix codes. Since no symbol is a prefix of some other
codes, all the symbols are immediately
decodable.

Prefix-free coding methods have been devised that
take advantage of this code-tree mapping. We can build a tree in such a way
that the “more-probable” symbols are placed near the root of the tree, while
allowing the less-frequent ones to occupy the bottom of the tree. Further, dynamic methods are also possible since
the code-tree can simply be “updated” at each encoding step and still provide a
correct mapping of the symbols based on their frequency counts.

The most well-known of the
prefix-free algorithms is Huffman coding [1952]. Classified as a “static”
order-0 algorithm, it is still dominant and applied in most compression
packages and file archivers. Prior to Huffman’s method was Shannon-Fano coding
which creates symbol codes by starting to build the tree from the top
down. Huffman’s technique was the exact opposite: build the bottom nodes first
before creating the upper nodes.

All static algorithms (e.g.,
Shannon-Fano and Huffman coding) have three phases [Gagie 2004]. The first
phase involves reading the source
message and recording the frequency
counts of the individual symbols in the message. Second, the coding method builds its code-tree. The idea is to
create a code-tree that optimizes the “placement” of symbols in the tree: it is
important that the most-frequent symbols be assigned smaller codelengths than
the less-frequent symbols. The third phase is the actual encoding of the message using the code-tree.

Decoding in static methods
has three phases too. First, we read
the order-0 statistics of the source message. Next, we build the code tree. And finally, we decode the compressed file. Since the codewords are prefix-free,
the decoder is guaranteed that it will always arrive at a leaf node which corresponds to a given symbol.