Difference between revisions of "Information"

Latest revision as of 10:32, 16 July 2021

A basic concept in cybernetics. In cybernetics one studies machines and living organisms only from the point of view of their ability to absorb information given to them, to store information in a "memory" , to transmit it over a communication channel, and to transform it into "signals" . The intuitive picture of information relative to certain quantities or phenomena contained in certain data is developed in cybernetics.

In certain situations it is just as natural to be able to compare various groups of data by the information contained in it as it is to compare plane figures by their "areas" : Independent of the manner of measuring areas one can prove that a figure $ A $ does not have a larger area than $ B $ if $ A $ can be completely included in $ B $( cf. Examples 1–3 below). The deeper fact that it is possible to express area by a number and thereby comparing figures of arbitrary shape is a result of an extensive mathematical theory. The analogue of this fundamental result in information theory is the statement that under definite, very wide, assumptions one may disregard the qualitative peculiarities of information and express its amount by a number. This number only describes the possibility of transmitting information over a communication channel and of storing it in machines with a memory.

Example 1. Specifying the position and velocity of a particle moving in a force field provides information on its position at any future moment of time; this information is, moreover, complete: its position can be exactly predicted. Specifying the energy of a particle also provides information, but this information is incomplete, obviously.

Example 2. The equality

$$ \tag{1 } a = b $$

provides information about the relation between the variables $ a $ and $ b $. The equality

$$ \tag{2 } a ^ {2} = b ^ {2} $$

provides less information (since (1) implies (2), but they are not equivalent). Finally, the equality (for real numbers)

$$ \tag{3 } a ^ {3} = b ^ {3} , $$

is equivalent to (1) and provides the same information, i.e. (1) and (3) are different forms of specifying the same information.

Example 3. Results of measurements of some physical quantity, performed within certain errors, provide information on its exact value. By increasing the number of observations one changes this information.

Example $ 3a $. The arithmetical average of results of observations also contains certain information about the quantity being measured. As is shown in mathematical statistics, if the errors have a normal probability distribution with known variance, then the arithmetical average contains all information.

Example 4. Suppose that the result of a measurement is a random variable $ \xi $. By transmitting $ \xi $ over a communication channel, $ \xi $ is distorted, so that at the receiving end of the channel one obtains the variable

$$ \eta = \xi + \theta $$

where $ \theta $ is independent of $ \xi $( in the sense of probability theory). The "output" $ \eta $ provides information on the "input" $ \xi $, and it is natural to assume that this information is smaller because $ \theta $ has "scattered" values.

In each of the examples given, data are compared with respect to providing information which is more complete or less. In Examples 1–3 the meaning of this comparison is clear and leads to the analysis of the equivalence or non-equivalence of certain relations. In Examples 3a and 4 this meaning needs to be made more precise. This is provided in mathematical statistics and information theory (for which these examples are typical).

At the basis of information theory is a definition suggested in 1948 by C.E. Shannon, of measuring the amount of information contained in one random object (event, variable, function, etc.) with respect to another. It consists in expressing the amount of information by a number. It can be extremely well explained in the simplest case when the random objects considered are random variables taking only a finite number of values. Let $ \xi $ be a random variable taking values $ x _ {1} \dots x _ {n} $ with probabilities $ p _ {1} \dots p _ {n} $ and let $ \eta $ be a random variable taking values $ y _ {1} \dots y _ {m} $ with probabilities $ q _ {1} \dots q _ {m} $. Then the information $ I ( \xi , \eta ) $ contained in $ \xi $ with respect to $ \eta $ is defined by the formula

$$ I ( \xi , \eta ) = \ \sum _ {i , j } p _ {ij} \mathop{\rm log} _ {2} \ \frac{p _ {ij} }{p _ {i} q _ {j} } , $$

where $ p _ {ij} $ is the probability of joint occurrence of $ \xi = x _ {i} $ and $ \eta = y _ {j} $, and the logarithm is to base 2. The information $ I ( \xi , \eta ) $ has a number of properties that are naturally required for a measure of quantity of information. Thus, always $ I ( \xi , \eta ) \geq 0 $, and equality holds if only if $ p _ {ij} = p _ {i} q _ {j} $ for all $ i $ and $ j $, i.e. if and only if $ \xi $ and $ \eta $ are independent random variables. Further, $ I ( \xi , \eta ) \leq I ( \eta , \eta ) $ and equality holds only if $ \eta $ is a function of $ \xi $( e.g. $ \eta = \xi ^ {2} $, etc.). More surprising is the fact that $ I ( \xi , \eta ) = I ( \eta , \xi ) $.

The quantity $ H ( \xi ) = I ( \xi , \xi ) = \sum _ {i} p _ {i} \mathop{\rm log} _ {2} ( 1 / p _ {i} ) $ is called the entropy of $ \xi $. The concept of the entropy is basic in information theory. The amount of information and the entropy are related by

$$ \tag{5 } I ( \xi , \eta ) = \ H ( \xi ) + H ( \eta ) - H ( \xi , \eta ) , $$

where $ H ( \xi , \eta ) $ is the entropy of the pair $ ( \xi , \eta ) $, i.e.

$$ H ( \xi , \eta ) = \ \sum _ {i , j } p _ {ij} \mathop{\rm log} _ {2} \ \frac{1}{p _ {ij} } . $$

The entropy turns out to be the average number of binary symbols necessary for differentiation (or description) of the possible values of a random variable. This makes it possible to understand the role of the amount of information

in "storing" information in machines with a memory. If $ \xi $ and $ \eta $ are independent random variables, then one needs on the average $ H ( \xi ) $ binary symbols to write down the values of $ \xi $, $ H ( \eta ) $ binary symbols for those of $ \eta $, and $ H ( \xi , \eta ) $ binary symbols for those of the pair $ ( \xi , \eta ) $. If $ \xi $ and $ \eta $ are dependent, then the average number of binary symbols necessary for writing down the pair $ ( \xi , \eta ) $ is less than $ H ( \xi ) + H ( \eta ) $, since $ H ( \xi , \eta ) = H ( \xi ) + H ( \eta ) - I ( \xi , \eta ) $.

Using deeper theorems, the role of the amount of information

in problems of information transmission over communication channels can be explained. The basic information-theoretic characteristic of channels, their so-called capacity (cf. Transmission rate of a channel), is defined in terms of the concept of "information" .

If $ \xi $ and $ \eta $ may take an infinite set of values, then by limit transition one obtains from :

$$ \tag{6 } I ( \xi , \eta ) = \int\limits \int\limits p ( x , y ) \mathop{\rm log} _ {2} \ \frac{p ( x , y ) }{p ( x) q ( y) } \ d x d y , $$

where $ p $ and $ q $ denote the corresponding probability densities. The entropies $ H ( \xi ) $ and $ H ( \eta ) $ do not exist in this case, but there is the formula, analogous to (5),

$$ \tag{7 } I ( \xi , \eta ) = h ( \xi ) + h ( \eta ) - h ( \xi , \eta ) , $$

where

$$ h ( \xi ) = \int\limits p ( x) \mathop{\rm log} _ {2} \frac{1}{p ( x) } d x $$

is the differential entropy of $ \xi $( $ h ( \eta ) $ and $ h ( \xi , \eta ) $ are defined likewise).

Example 5. Suppose that under the conditions of Example 4 the random variables $ \xi $ and $ \theta $ have normal probability distributions with mean zero and with variances equal to, respectively, $ \sigma _ \xi ^ {2} $ and $ \sigma _ \theta ^ {2} $. Then, as may be inferred from (6) or (7): $ I ( \eta , \xi ) = I ( \xi , \eta ) = ( 1 / 2 ) \mathop{\rm log} _ {2} ( 1 + \sigma _ \xi ^ {2} / \sigma _ \theta ^ {2} ) $. Thus, the amount of information in the "received signal" $ \eta $ with respect to the "transmitted signal" $ \xi $ tends to zero as the level of "noise" $ \theta $ grows (i.e. as $ \sigma _ \xi ^ {2} \rightarrow \infty $), and grows without bound when the "noise" vanishes (i.e. as $ \sigma _ \theta ^ {2} \rightarrow 0 $).

The case when the random variables $ \xi $ and $ \eta $ in Example 4 or 5 are stochastic functions (or, as one says, stochastic processes) $ \xi ( t) $ and $ \eta ( t) $, describing the variation of a quantity at the input, respectively output, of the channel, is of special interest. The amount of information in $ \eta ( t) $ with respect to $ \xi ( t) $ for a given level of noise (in acoustic terminology) may serve as a criterion of the quality of the channel itself.

In problems in mathematical statistics one also uses the concept of information (cf. Examples 3 and 3a). However, both by its formal definition as by the name it has been given, it differs from the concept defined above (in information theory). Statistics deals with a large number of results of observations and usually replaces the complete listing of them by certain combined characteristics. In this replacement information is sometimes lost, but under certain conditions the combined characteristics contain all the information contained in the complete data (this statement is explained at the end of Example 6 below). The concept of information was introduced into statistics by R.A. Fisher in 1921.

Example 6. Let $ \xi _ {1} \dots \xi _ {n} $ be the results of $ n $ independent observations of some quantity, normally distributed with probability density

$$ p ( x ; a ; \sigma ^ {2} ) = \ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \ \left \{ - \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$

where the parameters $ a $ and $ \sigma ^ {2} $( the mean and variance) are unknown and must be estimated using the results of observations. Sufficient statistics (i.e. functions in the results of observations containing complete information on the unknown parameters) for this case are provided by the arithmetical average

$$ \overline \xi \; = \frac{1}{n} \sum _ { i= } 1 ^ { n } \xi _ {i} , $$

and the so-called empirical variance

$$ s ^ {2} = \frac{1}{n} \sum _ { i= } 1 ^ { n } ( \xi _ {i} - \overline \xi \; ) ^ {2} . $$

If $ \sigma ^ {2} $ is known, then by itself $ \overline \xi \; $ is a sufficient statistic (cf. Example 3a).

The meaning of the term "complete information" can be clarified in the following way. Suppose one has a function of the unknown parameter $ \phi = \phi ( a , \sigma ^ {2} ) $, let $ \phi ^ {*} = \phi ^ {*} ( \xi _ {1} \dots \xi _ {n} ) $ be an estimator for it that is free of systematic errors. Suppose that the quality of the estimator (its exactness) is measured (as is usual in problems in mathematical statistics) by the variance of the difference $ \phi ^ {*} - \phi $. Then there exists another estimator $ \phi ^ {**} $, not depending on the remaining $ \xi _ {i} $ but only on $ \overline \xi \; $ and $ \sigma ^ {2} $, that is not worse (in the sense of the criterion mentioned above) than $ \phi ^ {*} $. Fisher has also proposed a measure of the (average) amount of information with respect to an unknown parameter contained in one observation. This concept is revealed in the theory of statistical estimation.

For references, see Information, transmission of.

Comments

References

[a1]	C.E. Shannon, "A mathematical theory of communication" Bell. System Techn. J. , 27 (1948) pp. 379–423; 623–656
[a2]	C.E. Shannon, W. Weaver, "The mathematical theory of communication" , Univ. Illinois Press (1949)
[a3]	T. Berger, "Rate distortion theory" , Prentice-Hall (1970)

How to Cite This Entry:
Information. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Information&oldid=14072

This article was adapted from an original article by Yu.V. Prokhorov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article

Navigation

Tools

Namespaces

Variants

Views

Actions

Difference between revisions of "Information"

Latest revision as of 10:32, 16 July 2021

Comments

References

@@ Line 1: / Line 1: @@
+<!--
+i0510401.png
+$#A+1 = 108 n = 0
+$#C+1 = 108 : ~/encyclopedia/old_files/data/I051/I.0501040 Information
+Automatically converted into TeX, above some diagnostics.
+Please remove this comment and the {{TEX|auto}} line below,
+if TeX found to be correct.
+-->
+{{TEX|auto}}
+{{TEX|done}}
 A basic concept in [[Cybernetics|cybernetics]]. In cybernetics one studies machines and living organisms only from the point of view of their ability to absorb information given to them, to store information in a  "memory" , to transmit it over a [[Communication channel|communication channel]], and to transform it into  "signals" . The intuitive picture of information relative to certain quantities or phenomena contained in certain data is developed in cybernetics.
-In certain situations it is just as natural to be able to compare various groups of data by the information contained in it as it is to compare plane figures by their  "areas" : Independent of the manner of measuring areas one can prove that a figure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510401.png" /> does not have a larger area than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510402.png" /> if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510403.png" /> can be completely included in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510404.png" /> (cf. Examples 1–3 below). The deeper fact that it is possible to express area by a number and thereby comparing figures of arbitrary shape is a result of an extensive mathematical theory. The analogue of this fundamental result in information theory is the statement that under definite, very wide, assumptions one may disregard the qualitative peculiarities of information and express its amount by a number. This number only describes the possibility of transmitting information over a communication channel and of storing it in machines with a memory.
+In certain situations it is just as natural to be able to compare various groups of data by the information contained in it as it is to compare plane figures by their  "areas" : Independent of the manner of measuring areas one can prove that a figure  $  A $
+does not have a larger area than  $  B $
+if  $  A $
+can be completely included in  $  B $(
+cf. Examples 1–3 below). The deeper fact that it is possible to express area by a number and thereby comparing figures of arbitrary shape is a result of an extensive mathematical theory. The analogue of this fundamental result in information theory is the statement that under definite, very wide, assumptions one may disregard the qualitative peculiarities of information and express its amount by a number. This number only describes the possibility of transmitting information over a communication channel and of storing it in machines with a memory.
 Example 1. Specifying the position and velocity of a particle moving in a force field provides information on its position at any future moment of time; this information is, moreover, complete: its position can be exactly predicted. Specifying the energy of a particle also provides information, but this information is incomplete, obviously.
@@ Line 7: / Line 23: @@
 Example 2. The equality
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510405.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1)</td></tr></table>
+$$ \tag{1 }
+a  =  b
+$$
-provides information about the relation between the variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510406.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510407.png" />. The equality
+provides information about the relation between the variables  $  a $
+and  $  b $.
+The equality
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510408.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2)</td></tr></table>
+$$ \tag{2 }
+a  ^ {2}  =  b  ^ {2}
+$$
 provides less information (since (1) implies (2), but they are not equivalent). Finally, the equality (for real numbers)
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i0510409.png" /></td> <td valign="top" style="width:5%;text-align:right;">(3)</td></tr></table>
+$$ \tag{3 }
+a  ^ {3}  =  b  ^ {3} ,
+$$
 is equivalent to (1) and provides the same information, i.e. (1) and (3) are different forms of specifying the same information.
@@ Line 21: / Line 45: @@
 Example 3. Results of measurements of some physical quantity, performed within certain errors, provide information on its exact value. By increasing the number of observations one changes this information.
-Example <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104010.png" />. The arithmetical average of results of observations also contains certain information about the quantity being measured. As is shown in mathematical statistics, if the errors have a normal probability distribution with known variance, then the arithmetical average contains all information.
+Example  $  3a $.
+The arithmetical average of results of observations also contains certain information about the quantity being measured. As is shown in mathematical statistics, if the errors have a normal probability distribution with known variance, then the arithmetical average contains all information.
-Example 4. Suppose that the result of a measurement is a random variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104011.png" />. By transmitting <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104012.png" /> over a communication channel, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104013.png" /> is distorted, so that at the receiving end of the channel one obtains the variable
+Example 4. Suppose that the result of a measurement is a random variable  $  \xi $.
+By transmitting  $  \xi $
+over a communication channel,  $  \xi $
+is distorted, so that at the receiving end of the channel one obtains the variable
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104014.png" /></td> </tr></table>
+$$
+\eta  =  \xi + \theta
+$$
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104015.png" /> is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104016.png" /> (in the sense of probability theory). The  "output"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104017.png" /> provides information on the  "input"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104018.png" />, and it is natural to assume that this information is smaller because <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104019.png" /> has  "scattered"  values.
+where  $  \theta $
+is independent of  $  \xi $(
+in the sense of probability theory). The  "output"   $  \eta $
+provides information on the  "input"   $  \xi $,
+and it is natural to assume that this information is smaller because  $  \theta $
+has  "scattered"  values.
 In each of the examples given, data are compared with respect to providing information which is more complete or less. In Examples 1–3 the meaning of this comparison is clear and leads to the analysis of the equivalence or non-equivalence of certain relations. In Examples 3a and 4 this meaning needs to be made more precise. This is provided in mathematical statistics and information theory (for which these examples are typical).
-At the basis of information theory is a definition suggested in 1948 by C.E. Shannon, of measuring the amount of information contained in one random object (event, variable, function, etc.) with respect to another. It consists in expressing the amount of information by a number. It can be extremely well explained in the simplest case when the random objects considered are random variables taking only a finite number of values. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104020.png" /> be a random variable taking values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104021.png" /> with probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104022.png" /> and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104023.png" /> be a random variable taking values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104024.png" /> with probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104025.png" />. Then the information <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104026.png" /> contained in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104027.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104028.png" /> is defined by the formula
+At the basis of information theory is a definition suggested in 1948 by C.E. Shannon, of measuring the amount of information contained in one random object (event, variable, function, etc.) with respect to another. It consists in expressing the amount of information by a number. It can be extremely well explained in the simplest case when the random objects considered are random variables taking only a finite number of values. Let  $  \xi $
+be a random variable taking values  $  x _ {1} \dots x _ {n} $
+with probabilities  $  p _ {1} \dots p _ {n} $
+and let  $  \eta $
+be a random variable taking values  $  y _ {1} \dots y _ {m} $
+with probabilities  $  q _ {1} \dots q _ {m} $.
+Then the information  $  I ( \xi , \eta ) $
+contained in  $  \xi $
+with respect to  $  \eta $
+is defined by the formula
+$$
+I ( \xi , \eta )  = \
+\sum _ {i , j }
+p _ {ij}   \mathop{\rm log} _ {2} \
+\frac{p _ {ij} }{p _ {i} q _ {j} }
+ ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104029.png" /></td> </tr></table>
+where  $  p _ {ij} $
+is the probability of joint occurrence of  $  \xi = x _ {i} $
+and  $  \eta = y _ {j} $,
+and the logarithm is to base 2. The information  $  I ( \xi , \eta ) $
+has a number of properties that are naturally required for a measure of quantity of information. Thus, always  $  I ( \xi , \eta ) \geq  0 $,
+and equality holds if only if  $  p _ {ij} = p _ {i} q _ {j} $
+for all  $  i $
+and  $  j $,
+i.e. if and only if  $  \xi $
+and  $  \eta $
+are independent random variables. Further,  $  I ( \xi , \eta ) \leq  I ( \eta , \eta ) $
+and equality holds only if  $  \eta $
+is a function of  $  \xi $(
+e.g.  $  \eta = \xi  ^ {2} $,
+etc.). More surprising is the fact that  $  I ( \xi , \eta ) = I ( \eta , \xi ) $.
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104030.png" /> is the probability of joint occurrence of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104031.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104032.png" />, and the logarithm is to base 2. The information <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104033.png" /> has a number of properties that are naturally required for a measure of quantity of information. Thus, always <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104034.png" />, and equality holds if only if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104035.png" /> for all <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104036.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104037.png" />, i.e. if and only if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104038.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104039.png" /> are independent random variables. Further, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104040.png" /> and equality holds only if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104041.png" /> is a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104042.png" /> (e.g. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104043.png" />, etc.). More surprising is the fact that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104044.png" />.
+The quantity  $  H ( \xi ) = I ( \xi , \xi ) = \sum _ {i} p _ {i}   \mathop{\rm log} _ {2} ( 1 / p _ {i} ) $
+is called the [[Entropy|entropy]] of  $  \xi $.
+The concept of the entropy is basic in information theory. The amount of information and the entropy are related by
-The quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104045.png" /> is called the [[Entropy|entropy]] of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104046.png" />. The concept of the entropy is basic in information theory. The amount of information and the entropy are related by
+$$ \tag{5 }
+I ( \xi , \eta )  = \
+H ( \xi ) + H ( \eta ) - H ( \xi , \eta ) ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104047.png" /></td> <td valign="top" style="width:5%;text-align:right;">(5)</td></tr></table>
+where  $  H ( \xi , \eta ) $
+is the entropy of the pair  $  ( \xi , \eta ) $,
+i.e.
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104048.png" /> is the entropy of the pair <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104049.png" />, i.e.
+$$
+H ( \xi , \eta )  = \
+\sum _ {i , j }
+p _ {ij}   \mathop{\rm log} _ {2} \
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104050.png" /></td> </tr></table>
+\frac{1}{p _ {ij} }
+ .
+$$
 The entropy turns out to be the average number of binary symbols necessary for differentiation (or description) of the possible values of a random variable. This makes it possible to understand the role of the amount of information
-in  "storing"  information in machines with a memory. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104051.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104052.png" /> are independent random variables, then one needs on the average <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104053.png" /> binary symbols to write down the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104054.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104055.png" /> binary symbols for those of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104056.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104057.png" /> binary symbols for those of the pair <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104058.png" />. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104059.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104060.png" /> are dependent, then the average number of binary symbols necessary for writing down the pair <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104061.png" /> is less than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104062.png" />, since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104063.png" />.
+in  "storing"  information in machines with a memory. If  $  \xi $
+and  $  \eta $
+are independent random variables, then one needs on the average  $  H ( \xi ) $
+binary symbols to write down the values of  $  \xi $,
+$  H ( \eta ) $
+binary symbols for those of  $  \eta $,
+and  $  H ( \xi , \eta ) $
+binary symbols for those of the pair  $  ( \xi , \eta ) $.
+If  $  \xi $
+and  $  \eta $
+are dependent, then the average number of binary symbols necessary for writing down the pair  $  ( \xi , \eta ) $
+is less than  $  H ( \xi ) + H ( \eta ) $,
+since  $  H ( \xi , \eta ) = H ( \xi ) + H ( \eta ) - I ( \xi , \eta ) $.
 Using deeper theorems, the role of the amount of information
@@ Line 53: / Line 144: @@
 in problems of information transmission over communication channels can be explained. The basic information-theoretic characteristic of channels, their so-called capacity (cf. [[Transmission rate of a channel|Transmission rate of a channel]]), is defined in terms of the concept of  "information" .
-If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104064.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104065.png" /> may take an infinite set of values, then by limit transition one obtains from :
+If  $  \xi $
+and  $  \eta $
+may take an infinite set of values, then by limit transition one obtains from :
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104066.png" /></td> <td valign="top" style="width:5%;text-align:right;">(6)</td></tr></table>
+$$ \tag{6 }
+I ( \xi , \eta )  =  \int\limits \int\limits
+p ( x , y )   \mathop{\rm log} _ {2} \
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104067.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104068.png" /> denote the corresponding probability densities. The entropies <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104069.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104070.png" /> do not exist in this case, but there is the formula, analogous to (5),
+\frac{p ( x , y ) }{p ( x) q ( y) }
+ \
+d x  d y ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104071.png" /></td> <td valign="top" style="width:5%;text-align:right;">(7)</td></tr></table>
+where  $  p $
+and  $  q $
+denote the corresponding probability densities. The entropies  $  H ( \xi ) $
+and  $  H ( \eta ) $
+do not exist in this case, but there is the formula, analogous to (5),
+$$ \tag{7 }
+I ( \xi , \eta )  =  h ( \xi ) + h ( \eta ) - h ( \xi , \eta ) ,
+$$
 where
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104072.png" /></td> </tr></table>
+$$
+h ( \xi )  =  \int\limits p ( x)   \mathop{\rm log} _ {2}
+\frac{1}{p ( x) }
+  d x
+$$
-is the [[Differential entropy|differential entropy]] of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104073.png" /> (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104074.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104075.png" /> are defined likewise).
+is the [[Differential entropy|differential entropy]] of  $  \xi $(
+$  h ( \eta ) $
+and  $  h ( \xi , \eta ) $
+are defined likewise).
-Example 5. Suppose that under the conditions of Example 4 the random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104076.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104077.png" /> have normal probability distributions with mean zero and with variances equal to, respectively, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104078.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104079.png" />. Then, as may be inferred from (6) or (7): <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104080.png" />. Thus, the amount of information in the  "received signal"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104081.png" /> with respect to the  "transmitted signal"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104082.png" /> tends to zero as the level of  "noise"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104083.png" /> grows (i.e. as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104084.png" />), and grows without bound when the  "noise"  vanishes (i.e. as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104085.png" />).
+Example 5. Suppose that under the conditions of Example 4 the random variables  $  \xi $
+and  $  \theta $
+have normal probability distributions with mean zero and with variances equal to, respectively,  $  \sigma _  \xi   ^ {2} $
+and  $  \sigma _  \theta   ^ {2} $.
+Then, as may be inferred from (6) or (7):  $  I ( \eta , \xi ) = I ( \xi , \eta ) = ( 1 / 2 )   \mathop{\rm log} _ {2} ( 1 + \sigma _  \xi   ^ {2} / \sigma _  \theta   ^ {2} ) $.
+Thus, the amount of information in the  "received signal"   $  \eta $
+with respect to the  "transmitted signal"   $  \xi $
+tends to zero as the level of  "noise"   $  \theta $
+grows (i.e. as  $  \sigma _  \xi   ^ {2} \rightarrow \infty $),
+and grows without bound when the  "noise"  vanishes (i.e. as  $  \sigma _  \theta   ^ {2} \rightarrow 0 $).
-The case when the random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104086.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104087.png" /> in Example 4 or 5 are stochastic functions (or, as one says, stochastic processes) <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104088.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104089.png" />, describing the variation of a quantity at the input, respectively output, of the channel, is of special interest. The amount of information in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104090.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104091.png" /> for a given level of noise (in acoustic terminology) may serve as a criterion of the quality of the channel itself.
+The case when the random variables  $  \xi $
+and  $  \eta $
+in Example 4 or 5 are stochastic functions (or, as one says, stochastic processes)  $  \xi ( t) $
+and  $  \eta ( t) $,
+describing the variation of a quantity at the input, respectively output, of the channel, is of special interest. The amount of information in  $  \eta ( t) $
+with respect to  $  \xi ( t) $
+for a given level of noise (in acoustic terminology) may serve as a criterion of the quality of the channel itself.
 In problems in mathematical statistics one also uses the concept of information (cf. Examples 3 and 3a). However, both by its formal definition as by the name it has been given, it differs from the concept defined above (in information theory). Statistics deals with a large number of results of observations and usually replaces the complete listing of them by certain combined characteristics. In this replacement information is sometimes lost, but under certain conditions the combined characteristics contain all the information contained in the complete data (this statement is explained at the end of Example 6 below). The concept of information was introduced into statistics by R.A. Fisher in 1921.
-Example 6. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104092.png" /> be the results of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104093.png" /> independent observations of some quantity, normally distributed with probability density
+Example 6. Let  $  \xi _ {1} \dots \xi _ {n} $
+be the results of  $  n $
+independent observations of some quantity, normally distributed with probability density
+$$
+p ( x ;  a ;  \sigma  ^ {2} )  = \
+\frac{1}{\sigma \sqrt {2 \pi } }
+   \mathop{\rm exp} \
+\left \{ -
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104094.png" /></td> </tr></table>
+\frac{( x - a )  ^ {2} }{2 \sigma  ^ {2} }
-where the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104095.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104096.png" /> (the mean and variance) are unknown and must be estimated using the results of observations. Sufficient statistics (i.e. functions in the results of observations containing complete information on the unknown parameters) for this case are provided by the arithmetical average
+\right \} ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104097.png" /></td> </tr></table>
+where the parameters  $  a $
+and  $  \sigma  ^ {2} $(
+the mean and variance) are unknown and must be estimated using the results of observations. [[Sufficient statistic]]s (i.e. functions in the results of observations containing complete information on the unknown parameters) for this case are provided by the arithmetical average
+$$
+\overline \xi \;  =
+\frac{1}{n}
+ \sum _ { i= } 1 ^ { n }  \xi _ {i} ,
+$$
 and the so-called empirical variance
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104098.png" /></td> </tr></table>
+$$
+s  ^ {2}  =
+\frac{1}{n}
+\sum _ { i= } 1 ^ { n }  ( \xi _ {i} - \overline \xi \; )  ^ {2} .
+$$
-If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i05104099.png" /> is known, then by itself <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040100.png" /> is a sufficient statistic (cf. Example 3a).
+If  $  \sigma  ^ {2} $
+is known, then by itself  $  \overline \xi \; $
+is a sufficient statistic (cf. Example 3a).
-The meaning of the term  "complete information"  can be clarified in the following way. Suppose one has a function of the unknown parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040101.png" />, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040102.png" /> be an estimator for it that is free of systematic errors. Suppose that the quality of the estimator (its exactness) is measured (as is usual in problems in mathematical statistics) by the variance of the difference <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040103.png" />. Then there exists another estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040104.png" />, not depending on the remaining <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040105.png" /> but only on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040106.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040107.png" />, that is not worse (in the sense of the criterion mentioned above) than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/i/i051/i051040/i051040108.png" />. Fisher has also proposed a measure of the (average) amount of information with respect to an unknown parameter contained in one observation. This concept is revealed in the theory of statistical estimation.
+The meaning of the term  "complete information"  can be clarified in the following way. Suppose one has a function of the unknown parameter  $  \phi = \phi ( a , \sigma  ^ {2} ) $,
+let  $  \phi  ^ {*} = \phi  ^ {*} ( \xi _ {1} \dots \xi _ {n} ) $
+be an estimator for it that is free of systematic errors. Suppose that the quality of the estimator (its exactness) is measured (as is usual in problems in mathematical statistics) by the variance of the difference  $  \phi  ^ {*} - \phi $.
+Then there exists another estimator  $  \phi  ^ {**} $,
+not depending on the remaining  $  \xi _ {i} $
+but only on  $  \overline \xi \; $
+and  $  \sigma  ^ {2} $,
+that is not worse (in the sense of the criterion mentioned above) than  $  \phi  ^ {*} $.
+Fisher has also proposed a measure of the (average) amount of information with respect to an unknown parameter contained in one observation. This concept is revealed in the theory of statistical estimation.
 For references, see [[Information, transmission of|Information, transmission of]].
 ====Comments====
 ====References====
 <table><TR><TD valign="top">[a1]</TD> <TD valign="top">  C.E. Shannon,   "A mathematical theory of communication"  ''Bell. System Techn. J.'' , '''27'''  (1948)  pp. 379–423; 623–656</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.E. Shannon,   W. Weaver,   "The mathematical theory of communication" , Univ. Illinois Press  (1949)</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  T. Berger,   "Rate distortion theory" , Prentice-Hall  (1970)</TD></TR></table>