Welcome to The Data Compression Guide!

    Data compression is one of the very exciting areas of computer science.  Almost every type of computer users, from students to the business-sector industries, depend on data compression techniques to store as much data as possible and maximize the use of storage devices. Here, we discuss the basic encoding algorithms that have dominated the field of data compression for years.

    The first section introduces the theory of data compression. This section starts with a concise definition of terms; presents the subject of information models, context and entropy; and then proceeds to a classification of the various encoding methods.  A simple compression algorithm, called Run-Length Encoding (or RLE), is immediately discussed with a complete program implementation. The inclusion of C source code is a must and is very helpful since readers can immediately try the programs and study how they work at the same time. It is also here that a Bit Stream Input/Output library is introduced. A bit I/O library is essential for reading and writing individual bits and certainly is the most basic tool of a data compression programmer.

    Section 2  features a whole suite of popular prefix-free coding algorithms. These algorithms output specific codes for individual symbols, and usually employ a “symbol-tree” to create the symbol codes. Static Huffman coding and dynamic Huffman coding (Algorithm FGK and Algorithm Vitter) are both treated extensively. Here the readers are introduced to the subject of “adaptive” Huffman coding in which Huffman codes are dynamically recreated for better and faster compression. The venerable Shannon-Fano coding (used in previous commercial file compressors) is also discussed and implemented.

    Section 3 presents the dictionary-based algorithms of Ziv and Lempel, which are considered the most powerful compression algorithms to date.  They are string compression algorithms in which a single code refers to a “group” of symbols. The excellent Lempel-Ziv (LZ77) and Lempel-Ziv-Welch (LZW) compression algorithms are both treated and carefully analyzed with complete C source code implementations. LZW is implemented using a ternary search tree (or TST), a digital search trie (via the hash function of Unix' LZC), and a binary search tree (BST), which is the fastest data structure for an LZW implementation.

    Section 4 covers data transformation methods such as the Run-Length Transform (RLT), the Move-To-Front (MTF) algorithm and the Burrows-Wheeler Transform (BWT) which are all clearly explained and implemented in C. The BWT algorithm will take a file as input and produce as output a file of “highly-redundant” bytes. The BWT has since generated more research activities and might interest you too.

    Readers who are computer science students will be familiar with the Huffman compression algorithm since this encoding technique is a staple topic in any data structures and algorithms course as a practical application of binary trees. For the beginning students, this algorithm is their “introduction” to the field of data compression.

    Now—with the complete Huffman program presented here—student readers can examine and dissect a real Huffman compressor and maybe come up with their own improvements to this classic algorithm. This “open-source” approach will lead them to implement the Huffman algorithm using advanced data structures for faster operations. The C programs are provided here in the hopes that they would be helpful, particularly to students.

    You are encouraged to study data compression and come up with your own improved algorithms.  But do not re-invent the wheel; study the current implementations and only then should you try to improve them.

    When computer programming, it is a good advice to take your time to rest a while and have fun—and good luck!


Gerald R. Tamayo

compgt [at] gmail.com