Huffman coding uses so-called binary trees to reduce the amount of information in a message.
Let's look at an example with a very basic version of this data "compression" algorithm.
Suppose we have the following text:
NARANJA MANDARINA
The text consists of a string with 17 characters, which will occupy 17 bytes in memory.
Each of these characters takes up 1 byte and is represented in binary using 8 bits.
For example, the letter A is represented as 01000001, and consequently, its ASCII code is: 1*2^0 + 0*2^2 + ... + 0*2^5 + 1* 2^6 + 0*2^7=65
Let's look at a code snippet to display the text in binary:
def string2bits(s=''):
return [bin(ord(x))[2:].zfill(8) for x in s]
def bits2string(b=None):
return ''.join([chr(int(x, 2)) for x in b])
txt = "NARANJA MANDARINA"
b = string2bits(txt)
We obtain the following string of 136 bits (17 characters with 8 bits each).
0100111001000001010100100100000101001110010010100100000100100000010011010100000101001110010001000100000101010010010010010100111001000001
The Huffman encoding method is based on the creation of a binary tree. To achieve this, we follow these steps:
1º) We create a list of the characters that make up the text, ordered from highest to lowest frequency.
For this, we use Python's "dictionary" structure:
# crea un diccionario con cada letra y su cantidad de apariciones
cont = {}
txt = "NARANJA KIWI MANDARINA"
for i in txt:
if i in cont: # verifica si existe en el diccionario
cont[i] += 1
else:
cont[i] = 1 # primera vez que aparece
# ordena el diccionario de mayor repeticion a menor
cont=dict(sorted(cont.items(), key=lambda item: item[1],reverse=True))
print(cont)
We obtain the following:
{'A': 6, 'N': 4, 'I': 3, 'R': 2, ' ': 2, 'J': 1, 'K': 1, 'W': 1, 'M': 1, 'D': 1}
2º) We create the binary tree. To do this, we place the characters ordered from highest to lowest frequency on the right side of the tree, assigning 0 to the left branches and 1 to the right branches according to the following scheme:
This tree assigns each character a numerical combination based on the path from the root to the specific character.
For example, to reach R, we follow 001, and to reach A, we follow 0.
Thus, we obtain the following representation for the characters:
This table can be obtained as follows:
# crea el arbol binario
cadena = "1"
for i in cont:
cont[i] = cadena
cadena = "0" + cadena
print("arbol ", cont)
With all this, we can attempt to "compress" the character string. To do this, we obtain a new binary representation by replacing each character with its corresponding code from the previous table.
# reemplazo cada caracter por el elemento del arbol
comprimido_binario = ""
for i in range(0,len(txt)):
comprimido_binario = comprimido_binario + cont[txt[i]]
binario = "0b" + comprimido_binario
print("comprimido binario", comprimido_binario)
We obtain the following binary representation, which lenght is 50 bits.
01100110100011000010000011010000001100100000001011
The original text had a lenght of 136 bits
0100111001000001010100100100000101001110010010100100000100100000010011010100000101001110010001000100000101010010010010010100111001000001
We observe that to "decompress," we must traverse the "compressed" binary string from left to right. The appearance of a character other than 0 indicates the separation between two characters.
01100110100011000010000011010000001100100000001011