# Huffman code

An algorithm for the construction of optimal (i.e., minimum redundancy) variable-length $D$- ary codes ( $D \geq 2$) for finite memoryless information sources with known statistics was proposed by D. Huffman in 1952 (see [a1], [a2]). (The algorithm refines the Morse alphabet, as it associates the shortest codewords to the most probable source letters.)

Let $A _ {K}$ be such a source with $K$ letters, $K \geq 2$, and let $p _ {i}$ be the probability that the source emits letter $a _ {i}$( $1 \leq i \leq K$). Let ${\mathcal C} = \{ w _ {1} \dots w _ {K} \}$ be a uniquely decipherable $D$- ary code for $A _ {K}$( i.e., a set of $K$ sequences $w _ {i}$ from an alphabet of $D$ symbols) and let $l _ {i}$ be the length (or number of $D$- ary symbols) of codeword $w _ {i}$( $1 \leq i \leq K$) associated to the letter $a _ {i}$. The difference

$$R ( {\mathcal C} ) = l ^ {*} ( {\mathcal C} ) - H ( P )$$

between the average length $l ^ {*} ( {\mathcal C} ) = \sum _ {i = 1 } ^ {K} p _ {i} l _ {i}$ of ${\mathcal C}$ and the source entropy (cf. also Entropy) $H ( P ) = - \sum _ {i = 1 } ^ {K} p _ {i} { \mathop{\rm log} } _ {D} p _ {i}$ is called the redundancy of ${\mathcal C}$. A Huffman code ${\mathcal H}$ for $A _ {K}$ minimizes $R ( {\mathcal C} )$( hence $l ^ {*} ( {\mathcal C} )$) over the set of all uniquely decipherable codes for $A _ {K}$.

In the binary case ( $D = 2$), the algorithm consists in the construction of a sequence of "reduced" sources $\{ A _ {K - 1 } \dots A _ {2} \}$, where $A _ {j - 1 }$ is obtained from $A _ {j}$ by merging the two least probable letters of $A _ {j}$ into one letter of $A _ {j - 1 }$( $K \geq j \geq 3$) and leaving all other letters unchanged (ties are broken arbitrarily). The probability of the new letter is the sum of the probabilities of the merged letters. The $K - 2$ mergings can be represented by a binary code tree in which each letter of $A _ {K}$ corresponds to a leaf and each codeword is the sequence of binary labels of the path from root to leaf. (If the code alphabet has $D > 2$ letters, each reduced source is obtained by merging the $D$ least probable letters of the previous source, except possibly in the first step, where $2 + R _ {D - 1 } ( K - 2 )$ letters have to be grouped together, $R _ {a} ( b )$ being the remainder of the division of $b$ by $a$. The resulting $D$- ary code tree is labeled and again the codewords are the sequences of labels from root to leaves.)

Owing to the arbitrary resolution of possible ties in grouping the least probable sequences at each step, for a given source $A _ {K}$ there may be several distinct Huffman codes, but two such codes are considered (essentially) different only if the corresponding sets $\{ l _ {1} \dots l _ {K} \}$ of codeword lengths do not coincide. However, the average length of all Huffman codes for $A _ {K}$ is the same.

More formally, consider the set ${\mathcal P} _ {K}$ of all $K$- order probability distributions (cf. Probability distribution) $P = \{ p _ {1} \dots p _ {K} \}$, $\sum _ {i = 1 } ^ {K} p _ {i} = 1$, $p _ {i} > 0$( $1 \leq i \leq K$). The informational divergence (also called the Kullback–Leibler information) between two probability distributions $P$, $Q$ in ${\mathcal P} _ {K}$ is the non-negative quantity $D ( P \mid Q ) = \sum _ {i = 1 } ^ {K} p _ {i} { \mathop{\rm log} } ( { {p _ {i} } / {q _ {i} } } )$. $D ( P \mid Q )$ vanishes if and only if $P \equiv Q$, hence it can be considered as an oriented pseudo-distance (cf. also Pseudo-metric) in ${\mathcal P} _ {K}$. Now, if $\{ l _ {1} \dots l _ {K} \}$ is the set of codeword lengths of a Huffman code ${\mathcal H}$ for $A _ {K}$, the redundancy $R ( {\mathcal H} )$ is equal to the informational divergence $D ( P \mid H )$ between the source probability distribution $P$ and the "dyadic" probability distribution $H = \{ 2 ^ {- l _ {1} } \dots 2 ^ {- l _ {K} } \}$.

The optimality of Huffman codes can now be restated by saying that to any $P \in {\mathcal P} _ {K}$( i.e., to any source $A _ {K}$), the algorithm associates (one of) the dyadic probability distribution(s) "closest" to $P$ in the sense of the pseudo-metric $D ( \cdot \mid \cdot )$.

In several practical situations the statistics of the source is not perfectly known or varies over time, and the necessity arises of updating the Huffman code when the new(ly known) $P$ becomes "closer" to a dyadic probability distribution different from the previous one (see [a3], [a4]).