1.3 Compression

Specification

Videos

The basics of compression

The first two videos, produced by Computerphile (University of Nottingham) explain compression in terms of lossless and lossy. The second video is a continuation of the first, so you need to watch in order.  The second video, while not so important for the exam, will aid your understanding of encoding systems.

The video below on the left, standalone from the above, aims to explain lossless and lossy compression in another way.

Lossless Compression Techniques

Run Length Encoding (RLE)

RLE stands for Run Length Encoding. It is a lossless algorithm that only offers decent compression ratios in specific types of data.  This section is taken from Prepressure and Wikipedia web pages.

How RLE works

RLE is probably the easiest compression algorithm there is. It replaces sequences of the same data values within a file by a count number and a single value. Suppose the following string of data (17 bytes) has to be compressed:

ABBBBBBBBBCDEEEEF

Using RLE compression, the compressed file takes up 10 bytes and could look like this:

A *8B C D *4E F

As you can see, repetitive strings of data are replaced by a control character (*) followed by the number of repeated characters and the repetitive character itself. The control character is not fixed, it can differ from implementation to implementation.

If the control character itself appears in the file then one extra character is coded.

As you can see, RLE encoding is only effective if there are sequences of 4 or more repeating characters because three characters are used to conduct RLE so coding two repeating characters would even lead to an increase in file size.

It is important to know that there are many different run-length encoding schemes. The above example has just been used to demonstrate the basic principle of RLE encoding. Sometimes the implementation of RLE is adapted to the type of data that are being compressed.

Advantages and disadvantages

This algorithm is very easy to implement and does not require much CPU horsepower. RLE compression is only efficient with files that contain lots of repetitive data. These can be text files if they contain lots of spaces for indenting but line-art images that contain large white or black areas are far more suitable. Computer generated colour images (e.g. architectural drawings) can also give fair compression ratios.

Example

For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. A hypothetical scan line, with B representing a black pixel and W representing white, might read as follows:

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:

12W1B12W3B24W1B14W

This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc.

The run-length code represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW).

Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:

WW12BWW12BB3WW24BWW14

This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.

One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).

Entropy Encoding

There will be no specific exam question on this technique.  It is included to show alternative methods of encoding using a lossless technique.  The following section is taken from the history of data compression algorithms website.

Entropy in data compression means the smallest number of bits needed, on average, to represent a symbol or literal. A basic entropy coder combines a statistical model and a coder. The input file is parsed and used to generate a statistical model that consists of the probabilities of a given symbol appearing. Then, the coder will use the statistical model to determine what bit-or-bytecodes to assign to each symbol such that the most common symbols have the shortest codes and the least common symbols have the longest codes.

Shannon-Fano Coding

This is one of the earliest compression techniques, invented in 1949 by Claude Shannon and Robert Fano. This technique involves generating a binary tree to represent the probabilities of each symbol occurring. The symbols are ordered such that the most frequent symbols appear at the top of the tree and the least likely symbols appear at the bottom. 

The code for a given symbol is obtained by searching for it in the Shannon-Fano tree, and appending to the code a value of 0 or 1 for each left or right branch taken, respectively. For example, if “A” is two branches to the left and one to the right its code would be “0012”. Shannon-Fano coding does not always produce optimal codes due to the way it builds the binary tree from the bottom up. For this reason, Huffman coding is used instead as it generates an optimal code for any given input. 

The algorithm to generate Shannon-Fano codes is fairly simple 

Huffman Coding

Huffman Coding is another variant of entropy coding that works in a very similar manner to Shannon-Fano Coding, but the binary tree is built from the top down to generate an optimal result. 

The algorithm to generate Huffman codes shares its first steps with Shannon-Fano: 

A good overview video from Tom Scott, looking at Huffman coding.  This is not mentioned in the syllabus, but it is an industry standard way of efficiently compressing text.

Lossy Compression

A compression technique that does not decompress digital data back to 100% of the original. Lossy methods can provide high degrees of compression and result in smaller compressed files, but some number of the original pixels, sound waves or video frames are removed forever. Examples are the widely used JPEG image, MPEG video and MP3 audio formats.

The greater the compression, the smaller the file. However, a high image compression loss can be observed in photos printed very large, and people with excellent hearing can notice a huge difference between MP3 music and high-resolution audio files (see audiophile). Typically, the moving frames of video can tolerate a greater loss of pixels than still images.

CODEC Challenge

Requirements (meet as many as these as you can to be crowned winner)

There is a test file on this page (in the files section), but a different file will be used to test your program.