# Information matrix

Fisher information

The covariance matrix of the informant. For a dominated family of probability distributions $P ^ {t} ( d \omega )$( cf. Density of a probability distribution) with densities $p ( \omega ; t )$ that depend sufficiently smoothly on a vector (in particular, numerical) parameter $t = ( t _ {1} \dots t _ {m} ) \in \Theta$, the elements of the information matrix are defined, for $t = \theta$, as

$$\tag{1 } I _ {jk} ( \theta ) = \ \int\limits _ \Omega \left . \frac{\partial \mathop{\rm ln} p ( \omega ; t ) }{\partial t _ {j} } \cdot \frac{\partial \mathop{\rm ln} p ( \omega ; t ) }{\partial t _ {k} } \right | _ {t = \theta } p ( \omega ; \theta ) d \mu ,$$

where $j , k = 1 \dots m$. For a scalar parameter $t$ the information matrix can be described by one number — the variance (cf. Dispersion) of the informant.

The information matrix $I ( \theta )$ determines a non-negative quadratic differential form

$$\tag{2 } \sum _ {j , k } I _ {jk} ( \theta ) d t _ {j} d t _ {k} = \Delta _ \theta ,$$

endowing the family $\{ P ^ {t} \}$ with a Riemannian metric. If the space $\Omega$ of outcomes $\omega$ is finite, then

$$\Delta _ {P} = \ \sum _ { j } \frac{( d p _ {j} ) ^ {2} }{p _ {j} } ; \ \ p _ {j} = P ( \omega _ {j} ) ,\ \ \forall \omega _ {j} \in \Omega .$$

The Fisher quadratic differential form (2) is the unique (up to a constant multiplier) quadratic differential form that is invariant under the category of statistical decision rules. Because of this fact it arises in the formulation of many statistical laws.

Any measurable mapping $f$ of the outcome space $\Omega$ generates a new smooth family of distributions $Q ^ {t} = P ^ {t} f ^ { - 1 }$ with information matrix $I ^ {Q} ( \theta )$, which is not greater than the initial one, i.e.

$$\sum _ {j , k } I _ {jk} ^ {Q} z _ {j} z _ {k} \leq \ \sum _ {j , k } I _ {jk} ^ {P} z _ {j} z _ {k} ,$$

whatever $z _ {1} \dots z _ {m}$. The information matrix also has the property of additivity. If $I ^ {(} i) ( \theta )$ is the information matrix for a family of densities $p _ {i} ( \omega ^ {(} i) ; t )$, then the family

$$p ( \omega ^ {(} 1) \dots \omega ^ {(} N) ; t ) = \ \prod _ { i= } 1 ^ { N } p _ {i} ( \omega ^ {(} i) , t )$$

has information matrix $I _ {N} ( \theta ) = \sum _ {i} I ^ {(} i) ( \theta )$. In particular, $I _ {N} ( \theta ) = N I ( \theta )$ for $N$ independent identically-distributed measurements. The information matrix allows one to characterize the statistical accuracy of decision rules in the problem of estimating the parameter of a distribution law. The variance of any unbiased estimator $\tau ( \omega ) = \tau ( \omega ^ {(} 1) \dots \omega ^ {(} N) )$ of a scalar parameter $t$ satisfies

$${\mathsf D} _ \theta \tau \geq \ [ N I ( \theta ) ] ^ {-} 1 .$$

The analogous matrix inequality for the information holds for estimators of a vector parameter. Its scalar consequence,

$$\tag{3 } {\mathsf E} _ \theta \sum _ {j , k = 1 } ^ { m } [ \tau _ {j} ( \omega ) - \theta _ {j} ] [ \tau _ {k} ( \omega ) - \theta _ {k} ] I _ {jk} ( \theta ) \geq \ m N ^ {-} 1 ,$$

shows that unbiased estimation can nowhere be too exact. For arbitrary estimators the latter is not true. However, restrictions remain, e.g., for the average accuracy:

$$\tag{4 } \mathfrak M _ {\Theta ^ \prime } {\mathsf E} _ \theta < \tau - \theta | I ( \theta ) | \tau - \theta > \geq m N ^ {-} 1 + o ( N ^ {-} 1 ) ,$$

where the average $\mathfrak M$ on the left-hand side of (3) is with respect to the invariant volume $V$ of any compact subdomain $\Theta ^ \prime \subset \Theta$,

$$d V ( \theta ) = \ \sqrt { \mathop{\rm det} I ( \theta ) } d \theta _ {1} \dots d \theta _ {m} ;$$

the remainder depends on the dimension of $\Theta ^ \prime$. Inequalities (4) are asymptotically exact, while the maximum-likelihood estimator is asymptotically optimal in this sense.

At degenerate points, for which $\mathop{\rm det} I ( \theta ) = 0$, joint estimation of parameters is difficult. If $\mathop{\rm det} I ( \theta ) = 0$ in a certain domain, then joint estimation is not possible at all. Following R. Fisher [1], one may say with appropriate qualification that the information matrix describes the amount of information (cf. Information, amount of) on the parameters of a distribution law that is contained in the random sample.

#### References

 [1] R.A. Fisher, "Theory of statistical estimation" Trans. Cambridge Philos. Soc. , 22 (1925) pp. 700–725 [2] J.R. Barra, "Notions fondamentales de statistique mathématique" , Dunod (1971) [3] N.N. [N.N. Chentsov] Čencov, "Statistical decision rules and optimal inference" , Amer. Math. Soc. (1982) (Translated from Russian)