# Sufficient statistic

for a family of probability distributions $\{ {P _ \theta } : {\theta \in \Theta } \}$ or for a parameter $\theta \in \Theta$

A statistic (a vector random variable) such that for any event $A$ there exists a version of the conditional probability $P _ \theta ( A \mid X = x )$ which is independent of $\theta$. This is equivalent to the requirement that the conditional distribution, given $X= x$, of any other statistic $Y$ is independent of $\theta$.

The knowledge of the sufficient statistic $X$ yields exhaustive material for statistical inferences about the parameter $\theta$, since no complementary statistical data can add anything to the information about the parameter contained in the distribution of $X$. This property is mathematically expressed as one of the results of the theory of statistical decision making which says that the set of decision rules based on a sufficient statistic forms an essentially complete class. The transition from the initial family of distributions to the family of distributions of the sufficient statistic is known as reduction of the statistical problem. The meaning of the reduction is a decrease (sometimes a very significant one) in the dimension of the observation space.

In practice, a sufficient statistic is found from the following factorization theorem. Let a family $\{ P _ \theta \}$ be dominated by a $\sigma$- finite measure $\mu$ and let $p _ \theta = d P _ \theta / d \mu$ be the density of $P _ \theta$ with respect to the measure $\mu$. A statistic $X$ is sufficient for the family $\{ P _ \theta \}$ if and only if

$$\tag{* } p _ \theta ( \omega ) = g _ \theta ( X ( \omega ) ) h ( \omega ) ,$$

where $g _ \theta$ and $h$ are non-negative measurable functions ( $h$ is independent of $\theta$). For discrete distributions the "counting" measure may be taken as $\mu$, and $p _ \theta ( \omega )$ in relation (*) has the meaning of the probability of the elementary event $\{ \omega \}$.

E.g., let $X _ {1} \dots X _ {n}$ be a sequence of independent random variables which assume the value one with an unknown probability $\nu$ and the value zero with probability $1 - \nu$( a Bernoulli scheme). Then

$$p _ \nu ( x _ {1} \dots x _ {n} ) = \prod _ {i = 1 } ^ { n } \nu ^ {x _ {i} } ( 1 - \nu ) ^ {1 - x _ {i} } = \nu ^ {\sum _ {i = 1 } ^ {n} x _ {i} } ( 1 - \nu ) ^ {n - \sum _ {i = 1 } ^ {n} x _ {i} } .$$

Equation (*) is satisfied if

$$X = \sum _ {i = 1 } ^ { n } X _ {i} ,\ g _ \theta = p _ \theta ,\ h = 1 \ ( \theta = \nu ).$$

Thus, the empirical frequency

$$\widehat \nu = \frac{1}{n} \sum _ {i = 1 } ^ { n } X _ {i}$$

is a sufficient statistic for the unknown probability $\nu$ in the Bernoulli scheme.

Let $X _ {1} \dots X _ {n}$ be a sequence of independent, normally distributed variables with unknown mean $\mu$ and unknown variance $\sigma ^ {2}$. The joint density of the distributions of $X _ {1} \dots X _ {n}$ with respect to Lebesgue measure is given by the expression

$$p _ {\mu , \sigma ^ {2} } ( x _ {1} \dots x _ {n} ) =$$

$$= \ ( 2 \pi \sigma ^ {2} ) ^ {- n / 2 } \mathop{\rm exp} \left [ - \frac{1}{2 \sigma ^ {2} } \sum _ {i = 1 } ^ { n } ( x _ {i} - \mu ) ^ {2} \right ] =$$

$$= \ ( 2 \pi \sigma ^ {2} ) ^ {- n / 2 } \mathop{\rm exp} \left ( - \frac{n \mu ^ {2} }{2 \sigma ^ {2} } - \frac{1}{2 \sigma ^ {2} } \sum _ {i = 1 } ^ { n } x _ {i} ^ {2} + \frac \mu {\sigma ^ {2} } \sum _ {i = 1 } ^ { n } x _ {i} \right ) ,$$

which depends on $x _ {1} \dots x _ {n}$ only by means of the variables

$$\sum _ {i = 1 } ^ { n } x _ {i} ,\ \sum _ {i = 1 } ^ { n } x _ {i} ^ {2} .$$

For this reason the vector statistic

$$X = \left ( \sum _ {i = 1 } ^ { n } X _ {i} , \sum _ {i = 1 } ^ { n } X _ {i} ^ {2} \right )$$

is a sufficient statistic for the two-dimensional parameter $\theta = ( \mu , \sigma ^ {2} )$. Here, the pair: sample mean

$$\widehat \mu = \frac{1}{n} \sum _ {i = 1 } ^ { n } X _ {i}$$

and sample variance

$${\widehat \sigma } {} ^ {2} = \frac{1}{n-} 1 \sum _ {i = 1 } ^ { n } ( X _ {i} - \widehat \mu ) ^ {2} ,$$

will also be a sufficient statistic, since the variables

$$\sum _ {i = 1 } ^ { n } X _ {i} ,\ \sum _ {i = 1 } ^ { n } X _ {i} ^ {2}$$

can be expressed in terms of $\widehat \mu$ and ${\widehat \sigma } {} ^ {2}$.

Many sufficient statistics may exist for a given family of distributions. In particular, the totality of all observations (in the example discussed above, $X _ {1} \dots X _ {n}$) is a trivial sufficient statistic. However, of main interest are statistics which permit a real reduction of the statistical problem. A sufficient statistic is known as minimal or necessary if it is a function of any other sufficient statistic. A necessary sufficient statistic realizes the utmost possible reduction of a statistical problem. In the examples discussed above the obtained sufficient statistics are also necessary.

An important application of the concept of sufficiency is the method of improvement of unbiased estimators, based on the Rao–Blackwell–Kolmogorov theorem: If $X$ is a sufficient statistic for the family $\{ P _ \theta \}$, and if $X _ {1}$ is an arbitrary statistic assuming values in the vector space $\mathbf R ^ {d}$, then the inequality

$${\mathsf E} _ \theta g ( X _ {1} - {\mathsf E} _ \theta ( X _ {1} ) ) \geq \ {\mathsf E} _ \theta g ( {\widehat{X} } _ {1} - {\mathsf E} _ \theta ( {\widehat{X} } _ {1} ) ) ,\ \theta \in \Theta ,$$

where ${\widehat{X} } _ {1} = {\mathsf E} _ \theta ( X _ {1} \mid X )$ is the conditional expectation of the statistic $X _ {1}$ with respect to $X$( which is in fact independent of $\theta$ by virtue of the sufficiency of $X$), holds for any real continuous convex function $g$ on $\mathbf R ^ {d}$. Often the loss function $g$ is taken to be a positive-definite quadratic form on $\mathbf R ^ {d}$.

A statistic $X$ is said to be a complete statistic if it follows from ${\mathsf E} _ \theta f ( X) \equiv 0$, $\theta \in \Theta$, that $f ( X) = 0$ almost surely with respect to $P _ \theta$, $\theta \in \Theta$. A corollary of the Rao–Blackwell–Kolmogorov theorem states that if a complete sufficient statistic $X$ exists, then it is the best unbiased estimator, uniformly in $\theta$, of its expectation $e ( \theta ) = {\mathsf E} _ \theta X$. The examples above describe such a situation. Thus, the empirical frequency $\widehat \nu$ is the uniformly best unbiased estimator of the probability $\nu$ in the Bernoulli scheme, while the sample mean $\widehat \mu$ and the variance ${\widehat \sigma } {} ^ {2}$ are the uniformly best unbiased estimators of the parameters $\mu$ and $\sigma ^ {2}$ of the normal distribution.

On the theoretical level it may be more convenient to deal with sufficient $\sigma$- algebras rather than with sufficient statistics. If $\{ {P _ \theta } : {\theta \in \Theta } \}$ is a family of distributions on a probability space $( \Omega , {\mathcal A} )$, then a sub- $\sigma$- algebra ${\mathcal B} \subset {\mathcal A}$ is said to be sufficient for $\{ P _ \theta \}$ if for any event $A \in {\mathcal A}$ there exists a version of the conditional probability $P _ \theta ( A \mid {\mathcal B} )$ which is independent of $\theta$. A statistic $X$ is sufficient if and only if the sub- $\sigma$- algebra ${\mathcal A} = X ^ {-} 1 ( {\mathcal B} )$ generated by it is sufficient.

#### References

 [1] P.R. Halmos, L.I. Savage, "Application of the Radon–Nikodym theorem to the theory of sufficient statistics" Ann. Math. Stat. , 20 (1949) pp. 225–241 [2] A.N. Kolmogorov, "Unbiased estimators" Izv. Akad. Nauk SSSR Ser. Mat. , 14 : 4 (1950) pp. 303–326 (In Russian) ((English translation in: Selected Works, Vol. 2 (Probability Theory and Mathematical Statistics), Kluwer, 1992, pp. 369–394.)) [3] C.R. Rao, "Linear statistical inference and its application" , Wiley (1973)