Lossless Compression

Lossless compression

As you learned in the previous step, compression reduces the amount of memory required to store a data file, but the output of a compressed data file can differ from the output of the original, uncompressed file. With lossless compression, the original data can be reconstructed perfectly, so there is no loss of data.

Take the words raspberry pi as an example. With regular ASCII character encoding, each letter is encoded by an 8-bit number.

r - 01110010

a - 01100001

s - 01110011

p - 01110000

b - 01100010

e - 01100101

r - 01110010

y - 01111001

- 00100000

p - 01110000

i - 01101001

With 12 characters to encode, this results to 96 bits of data. However, you may notice that the character r is present three times and the character p is present twice; as these characters are used the most frequently, encoding them with fewer bits would decrease the amount of data: they could be encoded with only 4 bits each:

r - 1010

a - 01100001

s - 01110011

p - 0011

b - 01100010

e - 01100101

r - 1010

y - 01111001

- 00100000

p - 0011

i - 01101001

That’s a reduction of 20 bits. The original data can easily be recovered by looking up the new encoding, and translating the 4-bit numbers back into 8-bit numbers. In a text containing thousands of characters, applying this method of compression could significantly reduce the amount of memory required to store the data.

Why lossless compression?

Lossless compression algorithms are needed in cases where we want the reconstructed file output to be identical to the original output.

Lossless compression is generally used for text compression: data such as database records, spreadsheets, and word processing files. When encoding and recovering this type of data, it is crucial to preserve each character exactly, because very small differences can have huge impact. For example, a tiny difference in a text can create very different meaning:

Do not fire the missile. versus Do now fire the missile.
Let’s eat, grandpa! versus Let’s eat grandpa!

This type of compression is also used for certain types of audio and video files.

A simple compression algorithm

Run-length encoding (RLE) is a very simple lossless compression algorithm. It works by taking runs of repeated data and storing them as single values. A simple analogy for RLE is the directions you give to someone while they are driving a car.

Your instructions might be this:

At the next traffic light, turn left
At the next traffic light, continue straight on
At the next traffic light, continue straight on
At the next traffic light, continue straight on
At the next traffic light, turn right

Using RLE, this can be compressed to:

At the next traffic light, turn left
At the next three traffic lights, continue straight on
At the next traffic light, turn right

No data has been lost, as the original instructions can easily be recreated based on the compressed ones.

Activity: an image compression script

Let’s have a look at the emoji you created last week. It was comprised of a grid of black and yellow pixels:

bbbbyybbbb

bbyyyyyybb

byyyyyyyyb

byybyybyyb

yyyyyyyyyy

yybyyyybyy

byybbbbyyb

byyyyyyyyb

bbyyyyyybb

bbbbyybbbb

These can all be placed on a single line:

bbbbyybbbbbbyyyyyybbbyyyyyyyybbyybyybyybyyyyyyyyyyyybyyyybyybyybbbbyybbyyyyyyyybbbyyyyyybbbbbbyybbbb

We want to reduce the length of this line. Where there are runs of a single letter, for instance 4 ys in a row, instead of the run (yyyy), we can write the number of times the letter is repeated, plus the letter: 4y.

If we do this for all the runs in the emoji file, the file contains the following characters:

4b2y6b6y3b8y2b2y1b2y1b2y1b12y1b4y1b2y1b2y4b2y2b8y3b6y6b2y4b

That’s a removal of 59 characters — a reduction of 41%!

Let’s have a look at a Python implementation of RLE. You don’t need to understand exactly how the script works, just the inputs and outputs.

Either open this repl.it project, or copy and paste the code below into a new Python file.

from re import sub

def encode(text):

return sub(r'(.)\1*', lambda m: str(len(m.group(0))) + m.group(1),text)

def decode(text):

return sub(r'(\d+)(\D)', lambda m: m.group(2) * int(m.group(1)),text)

raw = 'bbbbyybbbbbb'

print(encode(raw))

Run the file. You should see the following output:

4b2y6b

Experiment with this RLE script by changing the raw variable: alter the value of raw so that you feed the function different strings. Here’s the emoji:

raw = 'bbbbyybbbbbbyyyyyybbbyyyyyyyybbyybyybyybyyyyyyyyyyyybyyyybyybyybbbbyybbyyyyyyyybbbyyyyyybbbbbbyybbbb'

Use the script’s decode function to decode some compressed data as well.
What do you think about this implementation of RLE? Are there any ways in which the algorithm could be improved? Could there be times when the algorithm actually increases file size rather than decreasing it?

Page updated

Report abuse