The Data Compression Guide



  • Source coding

Prefix-free Coding

Dynamic Prefix-free Coding

Dictionary-based Compression

Data Transformation Methods

Research, Sites




    Beginning in the 1980s, the capacity and speed of storage devices have been tremendously improving. Today, new kinds of cheaper and more efficient memory devices are constantly emerging. Nevertheless, as files become too large, it is helpful if we could somehow encode them into a smaller size (when not in current use) and restore them back when needed later. That way, we can avail more of our storage media like diskettes, hard disks, CDs, USB Flash Disks, and tapes.

    The process of transforming a data source into a smaller size is called compression while restoring it back to its original form is called decompression. Typically, the data source is a string or sequence of bytes, or it might be a file or a group of files. For our purposes, we consider the data source to be a file though the techniques we study here apply the same to any source.

 Most data that we store in our computer devices are digital—each unit of information packed in binary form. This binary nature of the source is the subject of much research: how much information it really contains, and which better way to represent that information in a smaller number of binary digits, or bits. The compressed (or output) file is ultimately a concatenation of thousands—or even millions—of bit strings. Clearly, the cost of sending data over communications networks is minimized if the files are highly compressed. 

Definition of Terms

Data compression at its basic level does not only work on bytes; theoretically, we encode “symbols,” which may be 8-bit symbols or 16-bit symbols. These symbols, or numerical representations of objects, are distinct, and taken together the whole group is called the “alphabet.” In the English language, the 26 letters are the symbols and the group is called the English Alphabet. The whole string of symbols to compress is referred to as the “message” or “data  source.”

Consider an input message S of length m drawn from a source alphabet A of n distinct symbols, i.e., S = and A = {a0,...,an-1}. To compress a message symbol, we build a code (or codeword) for each of the symbols in A, thus creating another alphabet B. This alphabet of codes might be a standard alphabet or one which is deliberately created by the encoding algorithm according to the statistics of the message.

Normally, all distinct symbols of the source alphabet A have a one-to-one correspondence with the other members of the code alphabet B. That is, if A = {a0,...,an-1}, then B = {b0,...,bn-1}, where n is typically 256. This ensures that all the symbols can be decompressed using their complement codes.

We define compression methods that employ a one-to-one mapping technique as prefix-free coding algorithms, where no symbol is a prefix of any other symbols.  However, other coding techniques can create just one code for a “group” of symbols, enabling better compression. These methods are classified as dictionary-based algorithms.

    The process of creating and employing a certain code as a substitute for a message symbol is called encoding. In other words, the compression process is nothing more than a way of writing the data source using another alphabet. To be more effective, the written output must be smaller than the original data source. And in order for compression to be achieved, it follows that somehow most of the codes must be shorter in size than the original symbols. The size of the codes in bits is referred to as the codelength, or bit length. Limiting the codelengths is the main focus of data compression.

 Codes can be categorized in terms of “symbol-to-code” mapping, noting in particular the size in bits of the symbols and their corresponding codes [Lelewer and Hirschberg 1987].  These types of mapping are classified as block-block, block-variable, variable-block, and variable-variable coding.

 In block-block coding, the symbols and codes are of the same size. The Extended ASCII[1] codes and EBCDIC[2] codes are block-block codes and do not provide compression. On the other hand, block-variable coding maps individual symbols into codes of varying sizes—and provides compression.

 Dictionary-based algorithms are of the variable-block coding type since the number of symbols vary (creating “strings” of two or more symbols as much as possible), but the size of the codes (usually the indices in the dictionary) is fixed.

 A more advanced type is variable-variable coding. In practice, fixed-length dictionary codes can be turned into variable codes. The string codes are simply limited in size at the start of the coding process and gradually increased as the algorithm proceeds. After some finite amount of codes are defined, the codes stabilize into a block code.

Compression Ratio

    A common measure of a compression algorithm is its ability to produce smaller files. The so-called compression ratio is based on this measure. Authors have defined the term in several ways. Theoretically, as most purists note, the compression ratio should be the compressed size over the original size; others use original size / compressed size.

However, it persisted in the literature (and in actual compression programs like WinZIP) that this ratio correspond to the reduction in number of bytes, and it is usually expressed in percentage form. The percentage value is thus indicative of the bytes actually removed and not of the compressed size:

Comp. Ratio (%) = 

        (1 (compressed size/original size))  ×  100

    For example, if you have a 250-byte original file and the output file size is exactly 50 bytes, then your compression ratio is 80% and not 20%. In other words, the percentage of reduction is exactly 80%, and that the resulting compressed file is only 20% of the original file. Therefore, the higher the compression ratio, the smaller the compressed file size, which is the whole purpose of data compression.




[1]  American Standard Code for Information Interchange (ASCII)

[2]  Extended Binary Coded Decimal Interchange Code (EBCDIC)