Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length codes to input characters, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit sequences) are assigned in such a way that the code assigned to one character is not the prefix of code assigned to any other character. This is how Huffman Coding makes sure that there is no ambiguity when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the de-compressed output may be “cccd” or “ccb” or “acd” or “ab”.
There two major sections of a Huffman Encoding operation are building a Huffman Tree from input characters and traversing the Huffman Tree and assigning codes to the characters.
In order to understand the implementation, it's good for the user to be aware about the following for getting thorough idea of the project:
First, use this resource to get famililar with the steps required to generate a Huffman Encoded Symbol and how the decoding algorithm is required at the receiver end.
Second, this project requires basic knowledge of makefiles, as some header files which are based on data provided to Huffman Symbol generation function cannot be overlooked and they need to be available before the BUILD takes place.
In addition to this, irrespective of the operating system you are using (Windows or Linux Distribution), you need to have python installed on your system, else the required header files would either not be updated or even generated if it is the first iteration on your device.
The Collected Data (this will be elaborated in further sections) should be present on both the devices, i.e. the KL25Z (development board used in this project) and the Receiver Application. Without this file (collected_data.txt) the Huffman table cannot be generated and the compression cannot take place.
Currently, the KL25Z exchanges untouched debug statement data through UART0 with interrupts enabled. More details about the UART0 can be found in source files and reference manual. In addition, it uses circular buffers to perform this process which are directly linked to printf and scanf statements using system calls. Please refer systemcall files for a brief.
In order to encode the data being put over UART channel, the encoding needs to be performed each time a printf is called. But, before initiating the encoding process, if you referred the resource above, you would know that the Huffman Table is crucial for this step. This table requires feeding of raw data upon which the table would be based. It is important to note that, the characters that would be sent over debug statements, the raw input file must have at least one character for each of the element in the string, i.e. If you try to print "$" and your raw input does not have the same in the input file, you would get garbage values if you print that character. Moving ahead, the table would include the following for each symbol:
Character for which the encoding is taking place.
Binary Code for each character based on its occurring frequency in the raw file.
Number of bits for each symbol (The least occurring character would have highest number of bits in its binary code and vice-versa, WHY? That's the main idea behind this compression technique)
You can find the current composition of input file used to generate HuffmanSymbols.h in the images below. Once these encoded symbols are transferred over UART, there must be algorithm present over receiver end to decode this Huffman binary code. In order to have control of UART terminal from C code, it is important to use Linux. This is because Linux has a dedicated library for this purpose and offers decent support for the functionality of our project. The following steps take place on the Receiver end:
First, the UART terminal should be setup with appropriate parameters to read the incoming data. Without this you cannot figure out any error or mismatch in data due to ambiguous nature of data if wrong parameters are used.
Secondly, before running this program it is important to generate the same Huffman Table that was used to decode the data on the transmitter. Any minute difference in this table can lead to ambiguity in the debug statement data. To avoid this, it is important and must to use the same input raw file and header generating python script on the transmitter device. This would ensure an efficient decoding of the symbols being received.
You are almost there, after having the above two in action, now each time a byte of data is availabe to the receiver device, it will lookup for an equivalent symbol in the table generated in the HuffmanSymbols.h file and return the character for that code. Before starting the serial data receiver, it is important to test the encoding and decoding methods in this implementaion. For the same, the test function is embedded in the main program which has selected cases comprised of normal characters, numbers, special characters based on the raw input file. It is extremely important to note that the number of supported characters depends upon the file provided. And so does testing, it might look like many more cases can be added but it is directly dependent on the input file. Please refer the PC Side code in the repository to have deeper insights on how it has been implemented.