Namespaces
Variants
Actions

Difference between revisions of "Sufficient statistic"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
''for a family of probability distributions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910701.png" /> or for a parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910702.png" />''
+
<!--
 +
s0910701.png
 +
$#A+1 = 92 n = 0
 +
$#C+1 = 92 : ~/encyclopedia/old_files/data/S091/S.0901070 Sufficient statistic
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
  
A statistic (a vector random variable) such that for any event <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910703.png" /> there exists a version of the conditional probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910704.png" /> which is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910705.png" />. This is equivalent to the requirement that the conditional distribution, given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910706.png" />, of any other statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910707.png" /> is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910708.png" />.
+
{{TEX|auto}}
 +
{{TEX|done}}
  
The knowledge of the sufficient statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s0910709.png" /> yields exhaustive material for statistical inferences about the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107010.png" />, since no complementary statistical data can add anything to the information about the parameter contained in the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107011.png" />. This property is mathematically expressed as one of the results of the theory of statistical decision making which says that the set of decision rules based on a sufficient statistic forms an essentially complete class. The transition from the initial family of distributions to the family of distributions of the sufficient statistic is known as reduction of the statistical problem. The meaning of the reduction is a decrease (sometimes a very significant one) in the dimension of the observation space.
+
''for a family of probability distributions $  \{ {P _  \theta  } : {\theta \in \Theta } \} $
 +
or for a parameter  $  \theta \in \Theta $''
  
In practice, a sufficient statistic is found from the following factorization theorem. Let a family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107012.png" /> be dominated by a <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107013.png" />-finite measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107014.png" /> and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107015.png" /> be the density of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107016.png" /> with respect to the measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107017.png" />. A statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107018.png" /> is sufficient for the family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107019.png" /> if and only if
+
A statistic (a vector random variable) such that for any event  $  A $
 +
there exists a version of the conditional probability  $  P _  \theta  ( A \mid  X = x ) $
 +
which is independent of $  \theta $.  
 +
This is equivalent to the requirement that the conditional distribution, given  $  X= x $,
 +
of any other statistic $  Y $
 +
is independent of  $  \theta $.
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107020.png" /></td> <td valign="top" style="width:5%;text-align:right;">(*)</td></tr></table>
+
The knowledge of the sufficient statistic  $  X $
 +
yields exhaustive material for statistical inferences about the parameter  $  \theta $,
 +
since no complementary statistical data can add anything to the information about the parameter contained in the distribution of  $  X $.
 +
This property is mathematically expressed as one of the results of the theory of statistical decision making which says that the set of decision rules based on a sufficient statistic forms an essentially complete class. The transition from the initial family of distributions to the family of distributions of the sufficient statistic is known as reduction of the statistical problem. The meaning of the reduction is a decrease (sometimes a very significant one) in the dimension of the observation space.
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107021.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107022.png" /> are non-negative measurable functions (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107023.png" /> is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107024.png" />). For discrete distributions the "counting" measure may be taken as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107025.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107026.png" /> in relation (*) has the meaning of the probability of the elementary event <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107027.png" />.
+
In practice, a sufficient statistic is found from the following factorization theorem. Let a family  $  \{ P _  \theta  \} $
 +
be dominated by a $ \sigma $-
 +
finite measure $  \mu $
 +
and let  $  p _  \theta  = d P _  \theta  / d \mu $
 +
be the density of $  P _  \theta  $
 +
with respect to the measure  $  \mu $.
 +
A statistic  $  X $
 +
is sufficient for the family  $  \{ P _  \theta  \} $
 +
if and only if
  
E.g., let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107028.png" /> be a sequence of independent random variables which assume the value one with an unknown probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107029.png" /> and the value zero with probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107030.png" /> (a [[Bernoulli scheme|Bernoulli scheme]]). Then
+
$$ \tag{* }
 +
p _  \theta  ( \omega )  =  g _  \theta  ( X ( \omega ) ) h ( \omega ) ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107031.png" /></td> </tr></table>
+
where  $  g _  \theta  $
 +
and  $  h $
 +
are non-negative measurable functions ( $  h $
 +
is independent of  $  \theta $).
 +
For discrete distributions the  "counting" measure may be taken as  $  \mu $,
 +
and  $  p _  \theta  ( \omega ) $
 +
in relation (*) has the meaning of the probability of the elementary event  $  \{ \omega \} $.
 +
 
 +
E.g., let  $  X _ {1} \dots X _ {n} $
 +
be a sequence of independent random variables which assume the value one with an unknown probability  $  \nu $
 +
and the value zero with probability  $  1 - \nu $(
 +
a [[Bernoulli scheme|Bernoulli scheme]]). Then
 +
 
 +
$$
 +
p _  \nu  ( x _ {1} \dots x _ {n} )  = \prod _ {i = 1 } ^ { n }
 +
\nu ^ {x _ {i} } ( 1 - \nu ) ^ {1 - x _ {i} }  = \nu ^ {\sum _ {i
 +
= 1 }  ^ {n} x _ {i} } ( 1 - \nu ) ^ {n - \sum _ {i = 1 }  ^ {n}
 +
x _ {i} } .
 +
$$
  
 
Equation (*) is satisfied if
 
Equation (*) is satisfied if
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107032.png" /></td> </tr></table>
+
$$
 +
= \sum _ {i = 1 } ^ { n }  X _ {i} ,\  g _  \theta  = p _  \theta  ,\  h  = 1 \  ( \theta  = \nu ).
 +
$$
  
 
Thus, the empirical frequency
 
Thus, the empirical frequency
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107033.png" /></td> </tr></table>
+
$$
 +
\widehat \nu    =
 +
\frac{1}{n}
 +
\sum _ {i = 1 } ^ { n }  X _ {i}  $$
  
is a sufficient statistic for the unknown probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107034.png" /> in the Bernoulli scheme.
+
is a sufficient statistic for the unknown probability $  \nu $
 +
in the Bernoulli scheme.
  
Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107035.png" /> be a sequence of independent, normally distributed variables with unknown mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107036.png" /> and unknown variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107037.png" />. The joint density of the distributions of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107038.png" /> with respect to Lebesgue measure is given by the expression
+
Let $  X _ {1} \dots X _ {n} $
 +
be a sequence of independent, normally distributed variables with unknown mean $  \mu $
 +
and unknown variance $  \sigma  ^ {2} $.  
 +
The joint density of the distributions of $  X _ {1} \dots X _ {n} $
 +
with respect to Lebesgue measure is given by the expression
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107039.png" /></td> </tr></table>
+
$$
 +
p _ {\mu , \sigma  ^ {2}  } ( x _ {1} \dots x _ {n} ) =
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107040.png" /></td> </tr></table>
+
$$
 +
= \
 +
( 2 \pi \sigma  ^ {2} ) ^ {- n / 2 }  \mathop{\rm exp} \left [ -
 +
\frac{1}{2
 +
\sigma  ^ {2} }
 +
\sum _ {i = 1 } ^ { n }  ( x _ {i} - \mu )  ^ {2} \right ] =
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107041.png" /></td> </tr></table>
+
$$
 +
= \
 +
( 2 \pi \sigma  ^ {2} ) ^ {- n / 2 }  \mathop{\rm exp} \left ( -
 +
\frac{n \mu
 +
^ {2} }{2 \sigma  ^ {2} }
 +
-  
 +
\frac{1}{2 \sigma  ^ {2} }
 +
\sum _ {i
 +
= 1 } ^ { n }  x _ {i}  ^ {2} +
 +
\frac \mu {\sigma  ^ {2}
 +
}
 +
\sum _ {i = 1 } ^ { n }  x _ {i} \right ) ,
 +
$$
  
which depends on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107042.png" /> only by means of the variables
+
which depends on $  x _ {1} \dots x _ {n} $
 +
only by means of the variables
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107043.png" /></td> </tr></table>
+
$$
 +
\sum _ {i = 1 } ^ { n }  x _ {i} ,\  \sum _ {i = 1 } ^ { n }  x _ {i}  ^ {2} .
 +
$$
  
 
For this reason the vector statistic
 
For this reason the vector statistic
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107044.png" /></td> </tr></table>
+
$$
 +
= \left ( \sum _ {i = 1 } ^ { n }  X _ {i} , \sum _ {i = 1
 +
} ^ { n }  X _ {i}  ^ {2} \right )
 +
$$
  
is a sufficient statistic for the two-dimensional parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107045.png" />. Here, the pair: sample mean
+
is a sufficient statistic for the two-dimensional parameter $  \theta = ( \mu , \sigma  ^ {2} ) $.  
 +
Here, the pair: sample mean
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107046.png" /></td> </tr></table>
+
$$
 +
\widehat \mu    =
 +
\frac{1}{n}
 +
\sum _ {i = 1 } ^ { n }  X _ {i}  $$
  
 
and sample variance
 
and sample variance
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107047.png" /></td> </tr></table>
+
$$
 +
{\widehat \sigma  } {}  ^ {2}  =
 +
\frac{1}{n-}
 +
1 \sum _ {i = 1 } ^ { n }  ( X _ {i} -
 +
\widehat \mu  )  ^ {2} ,
 +
$$
  
 
will also be a sufficient statistic, since the variables
 
will also be a sufficient statistic, since the variables
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107048.png" /></td> </tr></table>
+
$$
 +
\sum _ {i = 1 } ^ { n }  X _ {i} ,\  \sum _ {i = 1 } ^ { n }  X _ {i}  ^ {2}
 +
$$
  
can be expressed in terms of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107049.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107050.png" />.
+
can be expressed in terms of $  \widehat \mu  $
 +
and $  {\widehat \sigma  } {}  ^ {2} $.
  
Many sufficient statistics may exist for a given family of distributions. In particular, the totality of all observations (in the example discussed above, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107051.png" />) is a trivial sufficient statistic. However, of main interest are statistics which permit a real reduction of the statistical problem. A sufficient statistic is known as minimal or necessary if it is a function of any other sufficient statistic. A necessary sufficient statistic realizes the utmost possible reduction of a statistical problem. In the examples discussed above the obtained sufficient statistics are also necessary.
+
Many sufficient statistics may exist for a given family of distributions. In particular, the totality of all observations (in the example discussed above, $  X _ {1} \dots X _ {n} $)  
 +
is a trivial sufficient statistic. However, of main interest are statistics which permit a real reduction of the statistical problem. A sufficient statistic is known as minimal or necessary if it is a function of any other sufficient statistic. A necessary sufficient statistic realizes the utmost possible reduction of a statistical problem. In the examples discussed above the obtained sufficient statistics are also necessary.
  
An important application of the concept of sufficiency is the method of improvement of unbiased estimators, based on the Rao–Blackwell–Kolmogorov theorem: If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107052.png" /> is a sufficient statistic for the family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107053.png" />, and if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107054.png" /> is an arbitrary statistic assuming values in the vector space <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107055.png" />, then the inequality
+
An important application of the concept of sufficiency is the method of improvement of unbiased estimators, based on the Rao–Blackwell–Kolmogorov theorem: If $  X $
 +
is a sufficient statistic for the family $  \{ P _  \theta  \} $,  
 +
and if $  X _ {1} $
 +
is an arbitrary statistic assuming values in the vector space $  \mathbf R  ^ {d} $,  
 +
then the inequality
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107056.png" /></td> </tr></table>
+
$$
 +
{\mathsf E} _  \theta  g ( X _ {1} - {\mathsf E} _  \theta  ( X _ {1} ) )  \geq  \
 +
{\mathsf E} _  \theta  g ( {\widehat{X}  } _ {1} - {\mathsf E} _  \theta  ( {\widehat{X}  } _ {1} )
 +
) ,\  \theta \in \Theta ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107057.png" /> is the conditional expectation of the statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107058.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107059.png" /> (which is in fact independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107060.png" /> by virtue of the sufficiency of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107061.png" />), holds for any real continuous convex function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107062.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107063.png" />. Often the loss function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107064.png" /> is taken to be a positive-definite quadratic form on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107065.png" />.
+
where $  {\widehat{X}  } _ {1} = {\mathsf E} _  \theta  ( X _ {1} \mid  X ) $
 +
is the conditional expectation of the statistic $  X _ {1} $
 +
with respect to $  X $(
 +
which is in fact independent of $  \theta $
 +
by virtue of the sufficiency of $  X $),  
 +
holds for any real continuous convex function $  g $
 +
on $  \mathbf R  ^ {d} $.  
 +
Often the loss function $  g $
 +
is taken to be a positive-definite quadratic form on $  \mathbf R  ^ {d} $.
  
A statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107066.png" /> is said to be a complete statistic if it follows from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107067.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107068.png" />, that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107069.png" /> almost surely with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107070.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107071.png" />. A corollary of the Rao–Blackwell–Kolmogorov theorem states that if a complete sufficient statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107072.png" /> exists, then it is the best unbiased estimator, uniformly in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107073.png" />, of its expectation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107074.png" />. The examples above describe such a situation. Thus, the empirical frequency <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107075.png" /> is the uniformly best unbiased estimator of the probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107076.png" /> in the Bernoulli scheme, while the sample mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107077.png" /> and the variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107078.png" /> are the uniformly best unbiased estimators of the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107079.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107080.png" /> of the normal distribution.
+
A statistic $  X $
 +
is said to be a complete statistic if it follows from $  {\mathsf E} _  \theta  f ( X) \equiv 0 $,  
 +
$  \theta \in \Theta $,  
 +
that $  f ( X) = 0 $
 +
almost surely with respect to $  P _  \theta  $,  
 +
$  \theta \in \Theta $.  
 +
A corollary of the Rao–Blackwell–Kolmogorov theorem states that if a complete sufficient statistic $  X $
 +
exists, then it is the best unbiased estimator, uniformly in $  \theta $,  
 +
of its expectation $  e ( \theta ) = {\mathsf E} _  \theta  X $.  
 +
The examples above describe such a situation. Thus, the empirical frequency $  \widehat \nu  $
 +
is the uniformly best unbiased estimator of the probability $  \nu $
 +
in the Bernoulli scheme, while the sample mean $  \widehat \mu  $
 +
and the variance $  {\widehat \sigma  } {}  ^ {2} $
 +
are the uniformly best unbiased estimators of the parameters $  \mu $
 +
and $  \sigma  ^ {2} $
 +
of the normal distribution.
  
On the theoretical level it may be more convenient to deal with sufficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107081.png" />-algebras rather than with sufficient statistics. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107082.png" /> is a family of distributions on a probability space <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107083.png" />, then a sub-<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107084.png" />-algebra <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107085.png" /> is said to be sufficient for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107087.png" /> if for any event <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107088.png" /> there exists a version of the conditional probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107089.png" /> which is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107090.png" />. A statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107091.png" /> is sufficient if and only if the sub-<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107092.png" />-algebra <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s091/s091070/s09107093.png" /> generated by it is sufficient.
+
On the theoretical level it may be more convenient to deal with sufficient $  \sigma $-
 +
algebras rather than with sufficient statistics. If $  \{ {P _  \theta  } : {\theta \in \Theta } \} $
 +
is a family of distributions on a probability space $  ( \Omega , {\mathcal A} ) $,  
 +
then a sub- $  \sigma $-
 +
algebra $  {\mathcal B} \subset  {\mathcal A} $
 +
is said to be sufficient for $  \{ P _  \theta  \} $
 +
if for any event $  A \in {\mathcal A} $
 +
there exists a version of the conditional probability $  P _  \theta  ( A \mid  {\mathcal B} ) $
 +
which is independent of $  \theta $.  
 +
A statistic $  X $
 +
is sufficient if and only if the sub- $  \sigma $-
 +
algebra $  {\mathcal A} = X  ^ {-} 1 ( {\mathcal B} ) $
 +
generated by it is sufficient.
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  P.R. Halmos,  L.I. Savage,  "Application of the Radon–Nikodym theorem to the theory of sufficient statistics"  ''Ann. Math. Stat.'' , '''20'''  (1949)  pp. 225–241</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  A.N. Kolmogorov,  "Unbiased estimators"  ''Izv. Akad. Nauk SSSR Ser. Mat.'' , '''14''' :  4  (1950)  pp. 303–326  (In Russian)  ((English translation in: Selected Works, Vol. 2 (Probability Theory and Mathematical Statistics), Kluwer, 1992, pp. 369–394.))</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its application" , Wiley  (1973)</TD></TR></table>
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  P.R. Halmos,  L.I. Savage,  "Application of the Radon–Nikodym theorem to the theory of sufficient statistics"  ''Ann. Math. Stat.'' , '''20'''  (1949)  pp. 225–241</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  A.N. Kolmogorov,  "Unbiased estimators"  ''Izv. Akad. Nauk SSSR Ser. Mat.'' , '''14''' :  4  (1950)  pp. 303–326  (In Russian)  ((English translation in: Selected Works, Vol. 2 (Probability Theory and Mathematical Statistics), Kluwer, 1992, pp. 369–394.))</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its application" , Wiley  (1973)</TD></TR></table>
 
 
  
 
====Comments====
 
====Comments====
 
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  E.L. Lehmann,  "Testing statistical hypotheses" , Wiley  (1986)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.R. Rao,  "Characterization problems in mathematical statistics" , Wiley  (1973)  pp. Chapt. 8  (Translated from Russian)</TD></TR></table>
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  E.L. Lehmann,  "Testing statistical hypotheses" , Wiley  (1986)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.R. Rao,  "Characterization problems in mathematical statistics" , Wiley  (1973)  pp. Chapt. 8  (Translated from Russian)</TD></TR></table>

Latest revision as of 08:24, 6 June 2020


for a family of probability distributions $ \{ {P _ \theta } : {\theta \in \Theta } \} $ or for a parameter $ \theta \in \Theta $

A statistic (a vector random variable) such that for any event $ A $ there exists a version of the conditional probability $ P _ \theta ( A \mid X = x ) $ which is independent of $ \theta $. This is equivalent to the requirement that the conditional distribution, given $ X= x $, of any other statistic $ Y $ is independent of $ \theta $.

The knowledge of the sufficient statistic $ X $ yields exhaustive material for statistical inferences about the parameter $ \theta $, since no complementary statistical data can add anything to the information about the parameter contained in the distribution of $ X $. This property is mathematically expressed as one of the results of the theory of statistical decision making which says that the set of decision rules based on a sufficient statistic forms an essentially complete class. The transition from the initial family of distributions to the family of distributions of the sufficient statistic is known as reduction of the statistical problem. The meaning of the reduction is a decrease (sometimes a very significant one) in the dimension of the observation space.

In practice, a sufficient statistic is found from the following factorization theorem. Let a family $ \{ P _ \theta \} $ be dominated by a $ \sigma $- finite measure $ \mu $ and let $ p _ \theta = d P _ \theta / d \mu $ be the density of $ P _ \theta $ with respect to the measure $ \mu $. A statistic $ X $ is sufficient for the family $ \{ P _ \theta \} $ if and only if

$$ \tag{* } p _ \theta ( \omega ) = g _ \theta ( X ( \omega ) ) h ( \omega ) , $$

where $ g _ \theta $ and $ h $ are non-negative measurable functions ( $ h $ is independent of $ \theta $). For discrete distributions the "counting" measure may be taken as $ \mu $, and $ p _ \theta ( \omega ) $ in relation (*) has the meaning of the probability of the elementary event $ \{ \omega \} $.

E.g., let $ X _ {1} \dots X _ {n} $ be a sequence of independent random variables which assume the value one with an unknown probability $ \nu $ and the value zero with probability $ 1 - \nu $( a Bernoulli scheme). Then

$$ p _ \nu ( x _ {1} \dots x _ {n} ) = \prod _ {i = 1 } ^ { n } \nu ^ {x _ {i} } ( 1 - \nu ) ^ {1 - x _ {i} } = \nu ^ {\sum _ {i = 1 } ^ {n} x _ {i} } ( 1 - \nu ) ^ {n - \sum _ {i = 1 } ^ {n} x _ {i} } . $$

Equation (*) is satisfied if

$$ X = \sum _ {i = 1 } ^ { n } X _ {i} ,\ g _ \theta = p _ \theta ,\ h = 1 \ ( \theta = \nu ). $$

Thus, the empirical frequency

$$ \widehat \nu = \frac{1}{n} \sum _ {i = 1 } ^ { n } X _ {i} $$

is a sufficient statistic for the unknown probability $ \nu $ in the Bernoulli scheme.

Let $ X _ {1} \dots X _ {n} $ be a sequence of independent, normally distributed variables with unknown mean $ \mu $ and unknown variance $ \sigma ^ {2} $. The joint density of the distributions of $ X _ {1} \dots X _ {n} $ with respect to Lebesgue measure is given by the expression

$$ p _ {\mu , \sigma ^ {2} } ( x _ {1} \dots x _ {n} ) = $$

$$ = \ ( 2 \pi \sigma ^ {2} ) ^ {- n / 2 } \mathop{\rm exp} \left [ - \frac{1}{2 \sigma ^ {2} } \sum _ {i = 1 } ^ { n } ( x _ {i} - \mu ) ^ {2} \right ] = $$

$$ = \ ( 2 \pi \sigma ^ {2} ) ^ {- n / 2 } \mathop{\rm exp} \left ( - \frac{n \mu ^ {2} }{2 \sigma ^ {2} } - \frac{1}{2 \sigma ^ {2} } \sum _ {i = 1 } ^ { n } x _ {i} ^ {2} + \frac \mu {\sigma ^ {2} } \sum _ {i = 1 } ^ { n } x _ {i} \right ) , $$

which depends on $ x _ {1} \dots x _ {n} $ only by means of the variables

$$ \sum _ {i = 1 } ^ { n } x _ {i} ,\ \sum _ {i = 1 } ^ { n } x _ {i} ^ {2} . $$

For this reason the vector statistic

$$ X = \left ( \sum _ {i = 1 } ^ { n } X _ {i} , \sum _ {i = 1 } ^ { n } X _ {i} ^ {2} \right ) $$

is a sufficient statistic for the two-dimensional parameter $ \theta = ( \mu , \sigma ^ {2} ) $. Here, the pair: sample mean

$$ \widehat \mu = \frac{1}{n} \sum _ {i = 1 } ^ { n } X _ {i} $$

and sample variance

$$ {\widehat \sigma } {} ^ {2} = \frac{1}{n-} 1 \sum _ {i = 1 } ^ { n } ( X _ {i} - \widehat \mu ) ^ {2} , $$

will also be a sufficient statistic, since the variables

$$ \sum _ {i = 1 } ^ { n } X _ {i} ,\ \sum _ {i = 1 } ^ { n } X _ {i} ^ {2} $$

can be expressed in terms of $ \widehat \mu $ and $ {\widehat \sigma } {} ^ {2} $.

Many sufficient statistics may exist for a given family of distributions. In particular, the totality of all observations (in the example discussed above, $ X _ {1} \dots X _ {n} $) is a trivial sufficient statistic. However, of main interest are statistics which permit a real reduction of the statistical problem. A sufficient statistic is known as minimal or necessary if it is a function of any other sufficient statistic. A necessary sufficient statistic realizes the utmost possible reduction of a statistical problem. In the examples discussed above the obtained sufficient statistics are also necessary.

An important application of the concept of sufficiency is the method of improvement of unbiased estimators, based on the Rao–Blackwell–Kolmogorov theorem: If $ X $ is a sufficient statistic for the family $ \{ P _ \theta \} $, and if $ X _ {1} $ is an arbitrary statistic assuming values in the vector space $ \mathbf R ^ {d} $, then the inequality

$$ {\mathsf E} _ \theta g ( X _ {1} - {\mathsf E} _ \theta ( X _ {1} ) ) \geq \ {\mathsf E} _ \theta g ( {\widehat{X} } _ {1} - {\mathsf E} _ \theta ( {\widehat{X} } _ {1} ) ) ,\ \theta \in \Theta , $$

where $ {\widehat{X} } _ {1} = {\mathsf E} _ \theta ( X _ {1} \mid X ) $ is the conditional expectation of the statistic $ X _ {1} $ with respect to $ X $( which is in fact independent of $ \theta $ by virtue of the sufficiency of $ X $), holds for any real continuous convex function $ g $ on $ \mathbf R ^ {d} $. Often the loss function $ g $ is taken to be a positive-definite quadratic form on $ \mathbf R ^ {d} $.

A statistic $ X $ is said to be a complete statistic if it follows from $ {\mathsf E} _ \theta f ( X) \equiv 0 $, $ \theta \in \Theta $, that $ f ( X) = 0 $ almost surely with respect to $ P _ \theta $, $ \theta \in \Theta $. A corollary of the Rao–Blackwell–Kolmogorov theorem states that if a complete sufficient statistic $ X $ exists, then it is the best unbiased estimator, uniformly in $ \theta $, of its expectation $ e ( \theta ) = {\mathsf E} _ \theta X $. The examples above describe such a situation. Thus, the empirical frequency $ \widehat \nu $ is the uniformly best unbiased estimator of the probability $ \nu $ in the Bernoulli scheme, while the sample mean $ \widehat \mu $ and the variance $ {\widehat \sigma } {} ^ {2} $ are the uniformly best unbiased estimators of the parameters $ \mu $ and $ \sigma ^ {2} $ of the normal distribution.

On the theoretical level it may be more convenient to deal with sufficient $ \sigma $- algebras rather than with sufficient statistics. If $ \{ {P _ \theta } : {\theta \in \Theta } \} $ is a family of distributions on a probability space $ ( \Omega , {\mathcal A} ) $, then a sub- $ \sigma $- algebra $ {\mathcal B} \subset {\mathcal A} $ is said to be sufficient for $ \{ P _ \theta \} $ if for any event $ A \in {\mathcal A} $ there exists a version of the conditional probability $ P _ \theta ( A \mid {\mathcal B} ) $ which is independent of $ \theta $. A statistic $ X $ is sufficient if and only if the sub- $ \sigma $- algebra $ {\mathcal A} = X ^ {-} 1 ( {\mathcal B} ) $ generated by it is sufficient.

References

[1] P.R. Halmos, L.I. Savage, "Application of the Radon–Nikodym theorem to the theory of sufficient statistics" Ann. Math. Stat. , 20 (1949) pp. 225–241
[2] A.N. Kolmogorov, "Unbiased estimators" Izv. Akad. Nauk SSSR Ser. Mat. , 14 : 4 (1950) pp. 303–326 (In Russian) ((English translation in: Selected Works, Vol. 2 (Probability Theory and Mathematical Statistics), Kluwer, 1992, pp. 369–394.))
[3] C.R. Rao, "Linear statistical inference and its application" , Wiley (1973)

Comments

References

[a1] E.L. Lehmann, "Testing statistical hypotheses" , Wiley (1986)
[a2] C.R. Rao, "Characterization problems in mathematical statistics" , Wiley (1973) pp. Chapt. 8 (Translated from Russian)
How to Cite This Entry:
Sufficient statistic. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Sufficient_statistic&oldid=17205
This article was adapted from an original article by A.S. Kholevo (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article