In its most general sense, computer data consists of a sequence of bytes. Huffman encoding is a lossless compression technique which exploits the fact that, as a general observation, not all byte values occur with equal frequency in the data. Given that is the case, considerable savings can be realized by representing byte values using a variable number of bits – short bit runs for the more frequently occurring values and longer ones for the less frequent ones. We refer to these variable length bit runs as Huffman Codes
. The technique used for generating these codes is as follows
- Establish the frequency of occurrence of each possible byte value – i.e. values in the range 0 – 255 for an 8 bit byte.
- Use this information to construct a binary tree – i.e. a tree data structure in which each node has no more than two children. The resulting tree has two kinds of nodes
- Leaf nodes – i.e. nodes with no children which represent actual byte values in the data. We assign as
weight
to each such node – its frequency of occurrence in the raw data. - Non-leaf nodes – i.e. nodes which have children. The
weight
for such nodes is the sum of the weights of their child nodes.
- Leaf nodes – i.e. nodes with no children which represent actual byte values in the data. We assign as
- A simple technique for building the Huffman tree goes like this
- Begin by creating a list of leaf nodes corresponding to each byte value in your data. Associate the frequency of occurrence of each value with its leaf.
- Now pick the two nodes with the lowest frequencies, remove them from the list, make them the children of a new non-leaf node and add that node to the list.
- Continue the process until the list has only one node left – the root node.
- Having done this we can build the Huffman codes for each leaf node in this tree. The procedure is as follows
- Start from the root node and an empty Huffman code.
- Move down the tree until you reach a leaf node.
- For each leftward move add a 0 to the Huffman code. For each rightward move add a 1 to the Huffman code.
- Continue the process until there are no unencoded leaves left.
This is best understood by looking at an example. Consider the string inflate deflate
. The Huffman tree for this string is shown below
The frequencies and the Huffman codes for each leaf node in this tree are tabulated below
Character | Frequency | Huffman Code |
---|---|---|
e | 3 | 00 |
a | 2 | 010 |
f | 2 | 011 |
l | 2 | 100 |
t | 2 | 101 |
(space) | 1 | 1100 |
d | 1 | 1101 |
i | 1 | 1110 |
n | 1 | 1111 |
Using these codes the original string can be represented by the binary pattern
1110111101110001010100110011010001110001010100
The 47 bits required to store this pattern represents a saving of 60%. While this may sound very good it should be noted that the technique does not always deliver a saving. Without the Huffman tree all we have in the compressed data is an amorphous mass of binary digits. In order to retrieve the original data we need to store some information regarding the tree from which the compressed data were constructed. This imposes an overhead that makes the method unsuitable for very short sequences of data. Click on the link below for a Delphi implementation of the algorithm.