As you learned in the previous step, compression reduces the amount of memory required to store a data file, but the output of a compressed data file can differ from the output of the original, uncompressed file. With lossless compression, the original data can be reconstructed perfectly, so there is no loss of data.
Take the words raspberry pi as an example. With regular ASCII character encoding, each letter is encoded by an 8-bit number.
r - 01110010
a - 01100001
s - 01110011
p - 01110000
b - 01100010
e - 01100101
r - 01110010
r - 01110010
y - 01111001
- 00100000
p - 01110000
i - 01101001
With 12 characters to encode, this results to 96 bits of data. However, you may notice that the character r is present three times and the character p is present twice; as these characters are used the most frequently, encoding them with fewer bits would decrease the amount of data: they could be encoded with only 4 bits each:
r - 1010
a - 01100001
s - 01110011
p - 0011
b - 01100010
e - 01100101
r - 1010
r - 1010
y - 01111001
- 00100000
p - 0011
i - 01101001
That’s a reduction of 20 bits. The original data can easily be recovered by looking up the new encoding, and translating the 4-bit numbers back into 8-bit numbers. In a text containing thousands of characters, applying this method of compression could significantly reduce the amount of memory required to store the data.
Lossless compression algorithms are needed in cases where we want the reconstructed file output to be identical to the original output.
Lossless compression is generally used for text compression: data such as database records, spreadsheets, and word processing files. When encoding and recovering this type of data, it is crucial to preserve each character exactly, because very small differences can have huge impact. For example, a tiny difference in a text can create very different meaning:
Do not fire the missile. versus Do now fire the missile.
Let’s eat, grandpa! versus Let’s eat grandpa!
This type of compression is also used for certain types of audio and video files.
Run-length encoding (RLE) is a very simple lossless compression algorithm. It works by taking runs of repeated data and storing them as single values. A simple analogy for RLE is the directions you give to someone while they are driving a car.
Your instructions might be this:
At the next traffic light, turn left
At the next traffic light, continue straight on
At the next traffic light, continue straight on
At the next traffic light, continue straight on
At the next traffic light, turn right
Using RLE, this can be compressed to:
At the next traffic light, turn left
At the next three traffic lights, continue straight on
At the next traffic light, turn right
No data has been lost, as the original instructions can easily be recreated based on the compressed ones.
Let’s have a look at the emoji you created last week. It was comprised of a grid of black and yellow pixels:
bbbbyybbbb
bbyyyyyybb
byyyyyyyyb
byybyybyyb
yyyyyyyyyy
yybyyyybyy
byybbbbyyb
byyyyyyyyb
bbyyyyyybb
bbbbyybbbb
These can all be placed on a single line:
bbbbyybbbbbbyyyyyybbbyyyyyyyybbyybyybyybyyyyyyyyyyyybyyyybyybyybbbbyybbyyyyyyyybbbyyyyyybbbbbbyybbbb
We want to reduce the length of this line. Where there are runs of a single letter, for instance 4 ys in a row, instead of the run (yyyy), we can write the number of times the letter is repeated, plus the letter: 4y.
If we do this for all the runs in the emoji file, the file contains the following characters:
4b2y6b6y3b8y2b2y1b2y1b2y1b12y1b4y1b2y1b2y4b2y2b8y3b6y6b2y4b
That’s a removal of 59 characters — a reduction of 41%!
Let’s have a look at a Python implementation of RLE. You don’t need to understand exactly how the script works, just the inputs and outputs.
Either open this repl.it project, or copy and paste the code below into a new Python file.
from re import sub
def encode(text):
return sub(r'(.)\1*', lambda m: str(len(m.group(0))) + m.group(1),text)
def decode(text):
return sub(r'(\d+)(\D)', lambda m: m.group(2) * int(m.group(1)),text)
raw = 'bbbbyybbbbbb'
print(encode(raw))
Run the file. You should see the following output:
4b2y6b
Experiment with this RLE script by changing the raw variable: alter the value of raw so that you feed the function different strings. Here’s the emoji:
raw = 'bbbbyybbbbbbyyyyyybbbyyyyyyyybbyybyybyybyyyyyyyyyyyybyyyybyybyybbbbyybbyyyyyyyybbbyyyyyybbbbbbyybbbb'
Use the script’s decode function to decode some compressed data as well.
What do you think about this implementation of RLE? Are there any ways in which the algorithm could be improved? Could there be times when the algorithm actually increases file size rather than decreasing it?