# Entropy

An information-theoretical measure of the degree of indeterminacy of a random variable. If $\xi$ is a discrete random variable defined on a probability space $( \Omega , \mathfrak A , {\mathsf P} )$ and assuming values $x _ {1} , x _ {2} \dots$ with probability distribution $\{ {p _ {k} } : {1 , 2 ,\dots } \}$, $p _ {k} = {\mathsf P} \{ \xi = x _ {k} \}$, then the entropy is defined by the formula

$$\tag{1 } H ( \xi ) = - \sum _ { k= } 1 ^ \infty p _ {k} \mathop{\rm log} p _ {k}$$

(here it is assumed that $0 \mathop{\rm log} 0 = 0$). The base of the logarithm can be any positive number, but as a rule one takes logarithms to the base 2 or $e$, which corresponds to the choice of a bit or a nat (natural unit) as the unit of measurement.

If $\xi$ and $\eta$ are two discrete random variables taking values $x _ {1} , x _ {2} \dots$ and $y _ {1} , y _ {2} \dots$ with probability distributions $\{ {p _ {k} } : {k = 1 , 2 ,\dots } \}$ and $\{ {q _ {j} } : {j = 1 , 2 ,\dots } \}$, and if $\{ {p _ {k\mid } j } : {k = 1 , 2 , . . . } \}$ is the conditional distribution of $\xi$ assuming that $\eta = y _ {j}$, $j = 1 , 2 \dots$ then the (mean) conditional entropy $H ( \xi \mid \eta )$ of $\xi$ given $\eta$ is defined as

$$\tag{2 } H ( \xi \mid \eta ) = - \sum _ { j= } 1 ^ \infty q _ {j} \sum _ { k= } 1 ^ \infty p _ {k\mid } j \mathop{\rm log} p _ {k\mid } j .$$

Let $\xi = \{ {\xi _ {k} } : {k = \dots, - 1 , 0 , 1 ,\dots } \}$ be a stationary process with discrete time and discrete space of values such that $H ( \xi _ {1} ) < \infty$. Then the entropy (more accurately, the mean entropy) $\overline{H}\; ( \xi )$ of this stationary process is defined as the limit

$$\tag{3 } \overline{H}\; ( \xi ) = \lim\limits _ {n \rightarrow \infty } \frac{1}{n} H ( \xi ^ {n} ) ,$$

where $H ( \xi ^ {n} )$ is the entropy of the random variable $\xi ^ {n} = ( \xi _ {1} \dots \xi _ {n} )$. It is known that the limit on the right-hand side of (3) always exists and that

$$\tag{4 } \overline{H}\; ( \xi ) = \lim\limits _ {n \rightarrow \infty } H ( \xi _ {n} \mid \xi _ {1} \dots \xi _ {n-} 1 ) ,$$

where $H ( \xi _ {n} \mid \xi _ {1} \dots \xi _ {n-} 1 )$ is the conditional entropy of $\xi _ {n}$ given $\xi ^ {n-} 1 = ( \xi _ {1} \dots \xi _ {n-} 1 )$. The entropy of stationary processes has important applications in the theory of dynamical systems.

If $\mu$ and $\nu$ are two measures on a measurable space $( \Omega , \mathfrak A )$ and if $\mu$ is absolutely continuous relative to $\nu$ and $d \mu / d \nu$ is the corresponding Radon–Nikodým derivative, then the entropy $H ( d \mu / d \nu )$ of $\mu$ relative to $\nu$ is defined as the integral

$$\tag{5 } H \left ( \frac{d \mu }{d \nu } \right ) = \int\limits _ \Omega \mathop{\rm log} \frac{d \mu }{d \nu } \nu ( d \omega ) .$$

A special case of the entropy of one measure with respect to another is the differential entropy.

Of the many possible generalizations of the concept of entropy in information theory one of the most important is the following. Let $\xi$ and $\widetilde \xi$ be two random variables taking values in certain measurable spaces $( \mathfrak X , S _ {\mathfrak X} )$ and $( \widetilde{\mathfrak X} , S _ {\widetilde{\mathfrak X} } )$. Suppose that the distribution $p ( \cdot )$ of $\xi$ is given and let $W$ be a class of admissible joint distributions of the pair $( \xi , \widetilde \xi )$ in the set of all probability measures in the product $( \mathfrak X \times \widetilde{\mathfrak X} , S _ {\mathfrak X} \times S _ {\widetilde{\mathfrak X} } )$. Then the $W$- entropy (or the entropy for a given condition $W$ of exactness of reproduction of information (cf. Information, exactness of reproducibility of)) is defined as the quantity

$$\tag{6 } H _ {W} ( \xi ) = \inf I ( \xi , \widetilde \xi ) ,$$

where $I ( \xi , \widetilde \xi )$ is the amount of information (cf. Information, amount of) in $\widetilde \xi$ given $\xi$ and the infimum is taken over all pairs of random variables $( \xi , \widetilde \xi )$ such that the joint distribution $p ( \cdot , \cdot )$ of the pair $( \xi , \widetilde \xi )$ belongs to $W$ and $\xi$ has the distribution $p ( \cdot )$. The class $W$ of joint distributions $p ( \cdot , \cdot )$ is often given by means of a certain non-negative measurable real-valued function $\rho ( x , \widetilde{x} )$, $x \in X$, $\widetilde{x} \in \widetilde{X}$, a measure of distortion, in the following manner:

$$\tag{7 } W = \{ {p ( \cdot , \cdot ) } : { {\mathsf E} \rho ( \xi , \widetilde \xi ) \leq \epsilon } \} ,$$

where $\epsilon > 0$ is fixed. In this case the quantity defined by (6), where $W$ is given by (7), is called the $\epsilon$- entropy (or the rate as a function of the distortion) and is denoted by $H _ \epsilon ( \xi )$. For example, if $\xi = ( \xi _ {1} \dots \xi _ {n} )$ is a Gaussian random vector with independent components, if ${\mathsf E} \xi _ {k} = 0$, $k = 1 \dots n$, and if the function $\rho ( x , \widetilde{x} )$, $x = ( x _ {1} \dots x _ {n} )$, $\widetilde{x} = ( \widetilde{x} _ {1} \dots \widetilde{x} _ {n} )$, has the form

$$\rho ( x , \widetilde{x} ) = \sum _ { k= } 1 ^ { n } ( x _ {k} - \widetilde{x} _ {k} ) ^ {2} ,$$

then $H _ \epsilon ( \xi )$ can be found by the formula

$$H _ \epsilon ( \xi ) = \frac{1}{2} \sum _ { k= } 1 ^ { n } { \mathop{\rm log} \max } \left ( \frac{ {\mathsf E} \xi _ {k} ^ {2} } \lambda , 1 \right ) ,$$

where $\lambda$ is defined by

$$\sum _ { k= } 1 ^ { n } \min ( \lambda , {\mathsf E} \xi _ {k} ^ {2} ) = \epsilon .$$

If $\xi$ is a discrete random variable, if $( \mathfrak X , S _ {\mathfrak X} )$ and $( \widetilde{\mathfrak X} , S _ \widetilde{ {\mathfrak X}} )$ are the same, and if $\rho ( x , \widetilde{x} )$ has the form

$$\rho ( x , \widetilde{x} ) = \left \{ \begin{array}{l} 0 \ \textrm{ if } x = \widetilde{x} , \\ 1 \ \textrm{ if } x \neq \widetilde{x} , \end{array} \right .$$

then the $\epsilon$- entropy for $\epsilon = 0$ is equal to the ordinary entropy defined in (1), that is, $H _ {0} ( \xi ) = H ( \xi )$.

#### References

 [1] C. Shannon, "A mathematical theory of communication" Bell System Techn. J. , 27 (1948) pp. 379–423; 623–656 [2] R.G. Gallager, "Information theory and reliable communication" , Wiley (1968) [3] T. Berger, "Rate distortion theory" , Prentice-Hall (1971) [4] P. Billingsley, "Ergodic theory and information" , Wiley (1956)