1.3 Compression
Specification
Show understanding of the need for and examples of the use of compression
Show understanding of lossy and lossless compression and justify the use of a given method in a given situation
Show understanding of how a text file, bitmap image, vector graphic and sound file can be compressed
Including the use of run-length encoding (RLE)
Videos
The basics of compression
The first two videos, produced by Computerphile (University of Nottingham) explain compression in terms of lossless and lossy. The second video is a continuation of the first, so you need to watch in order. The second video, while not so important for the exam, will aid your understanding of encoding systems.
The video below on the left, standalone from the above, aims to explain lossless and lossy compression in another way.
Lossless Compression Techniques
Run Length Encoding (RLE)
RLE stands for Run Length Encoding. It is a lossless algorithm that only offers decent compression ratios in specific types of data. This section is taken from Prepressure and Wikipedia web pages.
How RLE works
RLE is probably the easiest compression algorithm there is. It replaces sequences of the same data values within a file by a count number and a single value. Suppose the following string of data (17 bytes) has to be compressed:
ABBBBBBBBBCDEEEEF
Using RLE compression, the compressed file takes up 10 bytes and could look like this:
A *8B C D *4E F
As you can see, repetitive strings of data are replaced by a control character (*) followed by the number of repeated characters and the repetitive character itself. The control character is not fixed, it can differ from implementation to implementation.
If the control character itself appears in the file then one extra character is coded.
As you can see, RLE encoding is only effective if there are sequences of 4 or more repeating characters because three characters are used to conduct RLE so coding two repeating characters would even lead to an increase in file size.
It is important to know that there are many different run-length encoding schemes. The above example has just been used to demonstrate the basic principle of RLE encoding. Sometimes the implementation of RLE is adapted to the type of data that are being compressed.
Advantages and disadvantages
This algorithm is very easy to implement and does not require much CPU horsepower. RLE compression is only efficient with files that contain lots of repetitive data. These can be text files if they contain lots of spaces for indenting but line-art images that contain large white or black areas are far more suitable. Computer generated colour images (e.g. architectural drawings) can also give fair compression ratios.
Example
For example, consider a screen containing plain black text on a solid white background. There will be many long runs of white pixels in the blank space, and many short runs of black pixels within the text. A hypothetical scan line, with B representing a black pixel and W representing white, might read as follows:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
With a run-length encoding (RLE) data compression algorithm applied to the above hypothetical scan line, it can be rendered as follows:
12W1B12W3B24W1B14W
This can be interpreted as a sequence of twelve Ws, one B, twelve Ws, three Bs, etc.
The run-length code represents the original 67 characters in only 18. While the actual format used for the storage of images is generally binary rather than ASCII characters like this, the principle remains the same. Even binary data files can be compressed with this method; file format specifications often dictate repeated bytes in files as padding space. However, newer compression methods such as DEFLATE often use LZ77-based algorithms, a generalization of run-length encoding that can take advantage of runs of strings of characters (such as BWWBWWBWWBWW).
Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:
WW12BWW12BB3WW24BWW14
This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.
One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).
Entropy Encoding
There will be no specific exam question on this technique. It is included to show alternative methods of encoding using a lossless technique. The following section is taken from the history of data compression algorithms website.
Entropy in data compression means the smallest number of bits needed, on average, to represent a symbol or literal. A basic entropy coder combines a statistical model and a coder. The input file is parsed and used to generate a statistical model that consists of the probabilities of a given symbol appearing. Then, the coder will use the statistical model to determine what bit-or-bytecodes to assign to each symbol such that the most common symbols have the shortest codes and the least common symbols have the longest codes.
Shannon-Fano Coding
This is one of the earliest compression techniques, invented in 1949 by Claude Shannon and Robert Fano. This technique involves generating a binary tree to represent the probabilities of each symbol occurring. The symbols are ordered such that the most frequent symbols appear at the top of the tree and the least likely symbols appear at the bottom.
The code for a given symbol is obtained by searching for it in the Shannon-Fano tree, and appending to the code a value of 0 or 1 for each left or right branch taken, respectively. For example, if “A” is two branches to the left and one to the right its code would be “0012”. Shannon-Fano coding does not always produce optimal codes due to the way it builds the binary tree from the bottom up. For this reason, Huffman coding is used instead as it generates an optimal code for any given input.
The algorithm to generate Shannon-Fano codes is fairly simple
Parse the input, counting the occurrence of each symbol.
Determine the probability of each symbol using the symbol count.
Sort the symbols by probability, with the most probable first.
Generate leaf nodes for each symbol.
Divide the list in two while keeping the probability of the left branch roughly equal to those on the right branch.
Prepend 0 and 1 to the left and right nodes' codes, respectively.
Recursively apply steps 5 and 6 to the left and right subtrees until each node is a leaf in the tree.
Huffman Coding
Huffman Coding is another variant of entropy coding that works in a very similar manner to Shannon-Fano Coding, but the binary tree is built from the top down to generate an optimal result.
The algorithm to generate Huffman codes shares its first steps with Shannon-Fano:
Parse the input, counting the occurrence of each symbol.
Determine the probability of each symbol using the symbol count.
Sort the symbols by probability, with the most probable first.
Generate leaf nodes for each symbol, including P, and add them to a queue.
While (Nodes in Queue > 1)
Remove the two lowest probability nodes from the queue.
Prepend 0 and 1 to the left and right nodes' codes, respectively.
Create a new node with value equal to the sum of the nodes’ probability.
Assign the first node to the left branch and the second node to the right branch.
Add the node to the queue
The last node remaining in the queue is the root of the Huffman tree.[16]
A good overview video from Tom Scott, looking at Huffman coding. This is not mentioned in the syllabus, but it is an industry standard way of efficiently compressing text.
Lossy Compression
A compression technique that does not decompress digital data back to 100% of the original. Lossy methods can provide high degrees of compression and result in smaller compressed files, but some number of the original pixels, sound waves or video frames are removed forever. Examples are the widely used JPEG image, MPEG video and MP3 audio formats.
The greater the compression, the smaller the file. However, a high image compression loss can be observed in photos printed very large, and people with excellent hearing can notice a huge difference between MP3 music and high-resolution audio files (see audiophile). Typically, the moving frames of video can tolerate a greater loss of pixels than still images.
CODEC Challenge
Requirements (meet as many as these as you can to be crowned winner)
Must be able to read in any given text file (file will use standard ASCII text)
Must be able to output the new file in a format of your choice - it is your codec after all
A reverse process should be able to take in your encoded file and output the unchanged original
This means that carriage returns should be preserved
It should, of course, be lossless compression
Compress to the smallest possible file size, but as a single contained file.
You should attempt top-down design when coding your codec.
There should be an algorithm drawn up showing the process you will program
There is a test file on this page (in the files section), but a different file will be used to test your program.