Compression

Graphics, text and sound are all stored using a computer system. Below is a reminder of how computers store these types of information. We will then look at how compression is used.

Many sensible situations for compressing a file exist including:

To send as an email attachment as smaller files transfer quicker or may have maximum size of an attachment
Upload to a web site as larger files take longer to upload or might be restrictions on maximum file size that can be uploaded
Saving a file to disc if short on disc space or to save disc space

Graphic

Images on a computer system are made up of thousands of small coloured dots, known as pixels.
Bitmap images are stored as an array of pixels.
A black and white bitmap image will store a 1 for a black pixel and 0 for a white pixel.
A colour bitmap image is stored by as a longer number that represents how much red, green and blue (RGB) is required in the colour of each pixel this is known as colour depth.
Vectors do not store the data by pixels, but are a set of instructions for drawing a geometric shape

Analogue Wave Sound

Sampling is a method of converting an analogue sound
signal into a digital file.
At specific intervals (frequency – bit rate)
A measurement of the amplitude (bit depth) of the signal is taken.
The amplitude of each sound sample is converted into the equivalent binary number.

Computer Science is important.

Text

When characters are stored on a computer system, they are stored as a binary number.
A character set is a table that maps a character with a unique binary number
e.g. ASCII or Unicode

Data Compression

Lossy and lossless are two types of data compression used to compress digital graphics.

Data compression reduces the file size

A certain method uses the following compression ratios:

Lossy 10:1 and Lossless 10:9

Based on a 200KB image it is possible to calculate the resulting file size for each compression type.

Lossy File Size 10:1 ratio

200 divided by 10 = 20 KB

Lossless File Size 10:9 ratio

200 divided by 10 then multiply the answer by 9

= 180 KB

These compression types are not suitable for a text document because:-

Lossy is unsuitable.
The process is irreversible
Meaning text will be lost

Lossy Data Compression

Compressed files can never be recovered exactly as they were before they were compressed
When compressed files are decompressed they do not give back the original data, i.e. data is lost
Because lossy compression cannot be decompressed to yield the exact original data, it is not a good method of compression for critical data, such as textual data
It is most useful for digitally sampled analogue data, such as sound, video, graphics or images
Algorithms for lossy compression vary, but many use a threshold level truncation. This means that a level is chosen past which all data is truncated, e.g. in a sound file, the very high and low frequencies, which the human ear can not hear, may be truncated from the file
Some examples of lossy data compression algorithms are JPEG, MPEG, and MP3.

Lossless Data Compression

The original message can be decompressed back to its original form (recovers all original data)
Lossless data compression works by finding repeated patterns in data and compressing those patterns in an efficient manner.
For this reason, lossless data compression is also referred to as redundancy reduction. Because redundancy reduction is dependent on patterns in the message, it does not work well on random messages.
Lossless data compression is ideal for text. Most of the algorithms for lossless compression are based on the LZ compression method developed by Lempel and Ziv.
One type of text encoding which is very effective for files with long strings of repeating bits is RLE.
RLE stands for Run Length Encoding
RLE uses a sliding dictionary method of the LZ algorithm. The sliding dictionary method utilizes pointers within the compressed file that point to previously represented strings of bits within the file.
Here is an example of a message which could be
effectively encoded with RLE:
- The word the, is the most frequently used word in the English language. The string "the" could be represented only once and could be pointed to by all later calls to that string
Huffman coding works by analysing the frequency of elements in data. The elements with the highest frequency get assigned the shortest encoding (with the fewest bits). Elements with lower frequencies get assigned longer encodings (with more bits)
Huffman coding could be used to compress sound
files, particularly recordings containing frequencies of that heard in a human voice.

EXAMPLE - Simple Compressing Text Algorithm

A simple algorithm for compressing text is:

1. Find all combinations of characters that occur more than once

2. Replace all combinations of characters that occur more than once with a single substitute symbol

The following sentence has to be compressed using the algorithm: The cat sat on the mat.

Challenge 1 - What two character combinations can you identify that would be replaced in the above sentence

Challenge 2 - Can you think of a substitute symbol that could be used when compressing English sentences similar to the one above.

Challenge 3 - Can you justify why your chosen substitute symbol in Challenge 2 is appropriate.

Challenge ANSWERS

Challenge 1

‘at’ appears in the sentence 3 times; ‘he’ occurs twice; ‘t ‘ (t space) occurs twice; ‘e ‘ (e space) occurs twice

Challenge 2

A suitable substitute would be something not commonly found in English/Welsh language sentences

for example ~ ¬

Alternatives cannot be a letter or any possible punctuation that might legitimately occur in an English/welsh sentence

Challenge 3

Justification would be the character would not sensibly occur in a text file of English sentences

How File Compression WorksCompression programs can eliminate 50 to 95 percent of file size in seconds, but when you open up the file the data is still there! Learn how compression algorithms cut down on bytes without killing data.

Additional Resources

Page updated

Report abuse