Namespaces
Variants
Actions

Difference between revisions of "Empirical distribution"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
(Completed rendering of article in TeX.)
 
Line 1: Line 1:
 
''sample distribution''
 
''sample distribution''
  
A probability distribution determined from a sample for the estimation of a true distribution. Suppose that results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355701.png" /> of observations are independent identically-distributed random variables with distribution function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355702.png" /> and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355703.png" /> be the corresponding order statistics. The empirical distribution corresponding to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355704.png" /> is defined as the discrete distribution that assigns to every value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355705.png" /> the probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355706.png" />. The empirical distribution function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355707.png" /> is the step-function with steps of multiples of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355708.png" /> at the points defined by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e0355709.png" />:
+
A probability distribution that is determined from a random sample used for the estimation of a true distribution. Suppose that $ X_{1},\ldots,X_{n} $ are independent and identically-distributed random variables with distribution function $ F $, and let $ X_{(1)} \leq \ldots \leq X_{(n)} $ be the corresponding order statistics. The empirical distribution corresponding to $ (X_{1},\ldots,X_{n}) $ is defined as the discrete distribution that assigns to every value $ X_{k} $ the probability $ \dfrac{1}{n} $. The empirical distribution function $ F_{n} $ is the step-function with steps of multiples of $ \dfrac{1}{n} $ at the points defined by $ X_{(1)},\ldots,X_{(n)} $:
 
+
$$
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557010.png" /></td> </tr></table>
+
{F_{n}}(x) =
 
+
\begin{cases}
For fixed values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557011.png" /> the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557012.png" /> has all the properties of an ordinary [[Distribution function|distribution function]]. For every fixed real <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557013.png" /> the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557014.png" /> is a random variable as a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557015.png" />. Thus, the empirical distribution corresponding to a sample <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557016.png" /> is given by the family of random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557017.png" /> depending on the real parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557018.png" />. Here for a fixed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557019.png" />,
+
0, & \text{if} ~ x \leq X_{(1)}; \\
 
+
\dfrac{k}{n}, & \text{if} ~ X_{(k)} < x \leq X_{(k + 1)} ~ \text{and} ~ 1 \leq k \leq n - 1; \\
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557020.png" /></td> </tr></table>
+
1, & \text{if} ~ x > X_{(n)}.
 +
\end{cases}
 +
$$
  
 +
For fixed values of $ X_{1},\ldots,X_{n} $, the function $ F_{n} $ has all the properties of an ordinary [[Distribution function|distribution function]]. For every fixed $ x \in \mathbf{R} $, the function $ {F_{n}}(x) $ is a random variable as a function of $ X_{1},\ldots,X_{n} $. Hence, the empirical distribution corresponding to a random sample $ (X_{1},\ldots,X_{n}) $ is given by the family $ ({F_{n}}(x))_{x \in \mathbf{R}} $ of random variables. Here, for a fixed $ x \in \mathbf{R} $, we have
 +
$$
 +
\mathsf{E} {F_{n}}(x) = F(x), \qquad
 +
\mathsf{D} {F_{n}}(x) = \frac{1}{n} F(x) [1 - F(x)]
 +
$$
 
and
 
and
 +
$$
 +
\mathsf{P} \! \left\{ {F_{n}}(x) = \frac{k}{n} \right\} = \binom{n}{k} [F(x)]^{k} [1 - F(x)]^{n - k}.
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557021.png" /></td> </tr></table>
+
In accordance with the Law of Large Numbers, $ {F_{n}}(x) \to F(x) $ with probability $ 1 $ as $ n \to \infty $, for each $ x \in \mathbf{R} $. This means that $ {F_{n}}(x) $ is an unbiased and consistent estimator of the distribution function $ F(x) $. The empirical distribution function converges, uniformly in $ x $, with probability $ 1 $ to $ F(x) $ as $ n \to \infty $, i.e., if
 
+
$$
In accordance with the law of large numbers, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557022.png" /> with probability one as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557023.png" /> for every <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557024.png" />. This means that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557025.png" /> is an unbiased and consistent estimator of the distribution function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557026.png" />. The empirical distribution function converges, uniformly in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557027.png" />, with probability 1 to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557028.png" /> as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557029.png" />, i.e., if
+
D_{n} \stackrel{\text{df}}{=} \sup_{x \in \mathbf{R}} |{F_{n}}(x) - F(x)|,
 
+
$$
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557030.png" /></td> </tr></table>
+
then the '''Glivenko–Cantelli Theorem''' states that
 
+
$$
then
+
\mathsf{P} \! \left\{ \lim_{n \to \infty} D_{n} = 0 \right\} = 1.
 
+
$$
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557031.png" /></td> </tr></table>
 
 
 
(the Glivenko–Cantelli theorem).
 
 
 
The quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557032.png" /> is a measure of the proximity of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557033.png" /> to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557034.png" />. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557035.png" />,
 
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557036.png" /></td> </tr></table>
+
The quantity $ D_{n} $ is a measure of the proximity of $ {F_{n}}(x) $ to $ F(x) $. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function $ F(x) $, we have
 +
$$
 +
\forall z \in \mathbf{R}_{> 0}: \qquad
 +
\lim_{n \to \infty} \mathsf{P} \{ \sqrt{n} D_{n} < z \}
 +
= K(z)
 +
= \sum_{n = - \infty}^{\infty} (- 1)^{k} e^{- 2 k^{2} z^{2}}.
 +
$$
  
If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557037.png" /> is not known, then to verify the hypothesis that it is a given continuous function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557038.png" /> one uses tests based on statistics of type <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557039.png" /> (see [[Kolmogorov test|Kolmogorov test]]; [[Kolmogorov–Smirnov test|Kolmogorov–Smirnov test]]; [[Non-parametric methods in statistics|Non-parametric methods in statistics]]).
+
If $ F $ is not known, then to verify the hypothesis that it is a given continuous function $ F_{0} $, one uses tests based on statistics of type $ D_{n} $ (see [[Kolmogorov test|Kolmogorov test]]; [[Kolmogorov–Smirnov test|Kolmogorov–Smirnov test]]; [[Non-parametric methods in statistics|Non-parametric methods in statistics]]).
  
Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557040.png" /> is the sample mean, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557041.png" /> is the sample variance, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557042.png" /> is the sample moment of order <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e035/e035570/e03557043.png" />.
+
Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, $ \displaystyle \bar{X} = \sum_{k = 1}^{n} \frac{X_{k}}{n} $ is the '''sample mean''', $ \displaystyle s^{2} = \sum_{k = 1}^{n} \frac{\left( X_{k} - \bar{X} \right)^{2}}{n} $ is the '''sample variance''', and $ \displaystyle \widehat{\alpha}_{r} = \sum_{k = 1}^{n} \frac{X_{k}^{r}}{n} $ is the '''sample moment''' of order $ r $.
  
 
Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.
 
Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.
  
 
====References====
 
====References====
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  L.N. Bol'shev,  N.V. Smirnov,  "Tables of mathematical statistics" , ''Libr. math. tables'' , '''46''' , Nauka  (1983)  (In Russian)  (Processed by L.S. Bark and E.S. Kedrova)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  B.L. van der Waerden,  "Mathematische Statistik" , Springer  (1957)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  A.A. Borovkov,  "Mathematical statistics" , Moscow  (1984)  (In Russian)</TD></TR></table>
 
  
 +
<table>
 +
<TR><TD valign="top">[1]</TD><TD valign="top">
 +
L.N. Bol’shev, N.V. Smirnov, "Tables of mathematical statistics", ''Libr. math. tables'', '''46''', Nauka (1983). (In Russian) (Processed by L.S. Bark and E.S. Kedrova)</TD></TR>
 +
<TR><TD valign="top">[2]</TD><TD valign="top"> B.L. van der Waerden, "Mathematische Statistik", Springer (1957).</TD></TR>
 +
<TR><TD valign="top">[3]</TD> <TD valign="top"> A.A. Borovkov, "Mathematical statistics", Moscow (1984). (In Russian)</TD></TR>
 +
</table>
  
 +
====Comments====
  
====Comments====
+
The use of the empirical distribution in statistics and the associated theory has been greatly developed in recent years. This has been surveyed in [[#References|[a2]]]. For developments in strong convergence theory associated with the empirical distribution, see [[#References|[a1]]].
The use of the empirical distribution in statistics and the associated theory has been greatly developed in recent years. This has been surveyed in [[#References|[a2]]]. For the developments in strong convergence theory associated with the empirical distribution see [[#References|[a1]]].
 
  
 
====References====
 
====References====
<table><TR><TD valign="top">[a1]</TD> <TD valign="top"> M. Csörgö,   P. Révész,   "Strong approximation in probability and statistics" , Acad. Press (1981)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top"> G.R. Shorack,   J.A. Wellner,   "Empirical processes with applications to statistics" , Wiley (1986)</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top"> M. Loève,   "Probability theory" , Princeton Univ. Press (1963) pp. Sect. 16.3</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top"> P. Gaenssler,   W. Stute,   "Empirical processes: a survey of results for independent and identically distributed random variables" ''Ann. Prob.'' , '''7''' (1977) pp. 193–243</TD></TR></table>
+
 
 +
<table>
 +
<TR><TD valign="top">[a1]</TD> <TD valign="top">
 +
M. Csörgö, P. Révész, "Strong approximation in probability and statistics", Acad. Press (1981).</TD></TR>
 +
<TR><TD valign="top">[a2]</TD><TD valign="top">
 +
G.R. Shorack, J.A. Wellner, "Empirical processes with applications to statistics", Wiley (1986).</TD></TR>
 +
<TR><TD valign="top">[a3]</TD><TD valign="top">
 +
M. Loève, "Probability theory", Princeton Univ. Press (1963), pp. Sect. 16.3.</TD></TR>
 +
<TR><TD valign="top">[a4]</TD> <TD valign="top">
 +
P. Gaenssler, W. Stute, "Empirical processes: a survey of results for independent and identically distributed random variables", ''Ann. Prob.'', '''7''' (1977), pp. 193–243.</TD></TR>
 +
</table>

Latest revision as of 04:08, 22 June 2017

sample distribution

A probability distribution that is determined from a random sample used for the estimation of a true distribution. Suppose that $ X_{1},\ldots,X_{n} $ are independent and identically-distributed random variables with distribution function $ F $, and let $ X_{(1)} \leq \ldots \leq X_{(n)} $ be the corresponding order statistics. The empirical distribution corresponding to $ (X_{1},\ldots,X_{n}) $ is defined as the discrete distribution that assigns to every value $ X_{k} $ the probability $ \dfrac{1}{n} $. The empirical distribution function $ F_{n} $ is the step-function with steps of multiples of $ \dfrac{1}{n} $ at the points defined by $ X_{(1)},\ldots,X_{(n)} $: $$ {F_{n}}(x) = \begin{cases} 0, & \text{if} ~ x \leq X_{(1)}; \\ \dfrac{k}{n}, & \text{if} ~ X_{(k)} < x \leq X_{(k + 1)} ~ \text{and} ~ 1 \leq k \leq n - 1; \\ 1, & \text{if} ~ x > X_{(n)}. \end{cases} $$

For fixed values of $ X_{1},\ldots,X_{n} $, the function $ F_{n} $ has all the properties of an ordinary distribution function. For every fixed $ x \in \mathbf{R} $, the function $ {F_{n}}(x) $ is a random variable as a function of $ X_{1},\ldots,X_{n} $. Hence, the empirical distribution corresponding to a random sample $ (X_{1},\ldots,X_{n}) $ is given by the family $ ({F_{n}}(x))_{x \in \mathbf{R}} $ of random variables. Here, for a fixed $ x \in \mathbf{R} $, we have $$ \mathsf{E} {F_{n}}(x) = F(x), \qquad \mathsf{D} {F_{n}}(x) = \frac{1}{n} F(x) [1 - F(x)] $$ and $$ \mathsf{P} \! \left\{ {F_{n}}(x) = \frac{k}{n} \right\} = \binom{n}{k} [F(x)]^{k} [1 - F(x)]^{n - k}. $$

In accordance with the Law of Large Numbers, $ {F_{n}}(x) \to F(x) $ with probability $ 1 $ as $ n \to \infty $, for each $ x \in \mathbf{R} $. This means that $ {F_{n}}(x) $ is an unbiased and consistent estimator of the distribution function $ F(x) $. The empirical distribution function converges, uniformly in $ x $, with probability $ 1 $ to $ F(x) $ as $ n \to \infty $, i.e., if $$ D_{n} \stackrel{\text{df}}{=} \sup_{x \in \mathbf{R}} |{F_{n}}(x) - F(x)|, $$ then the Glivenko–Cantelli Theorem states that $$ \mathsf{P} \! \left\{ \lim_{n \to \infty} D_{n} = 0 \right\} = 1. $$

The quantity $ D_{n} $ is a measure of the proximity of $ {F_{n}}(x) $ to $ F(x) $. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function $ F(x) $, we have $$ \forall z \in \mathbf{R}_{> 0}: \qquad \lim_{n \to \infty} \mathsf{P} \{ \sqrt{n} D_{n} < z \} = K(z) = \sum_{n = - \infty}^{\infty} (- 1)^{k} e^{- 2 k^{2} z^{2}}. $$

If $ F $ is not known, then to verify the hypothesis that it is a given continuous function $ F_{0} $, one uses tests based on statistics of type $ D_{n} $ (see Kolmogorov test; Kolmogorov–Smirnov test; Non-parametric methods in statistics).

Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, $ \displaystyle \bar{X} = \sum_{k = 1}^{n} \frac{X_{k}}{n} $ is the sample mean, $ \displaystyle s^{2} = \sum_{k = 1}^{n} \frac{\left( X_{k} - \bar{X} \right)^{2}}{n} $ is the sample variance, and $ \displaystyle \widehat{\alpha}_{r} = \sum_{k = 1}^{n} \frac{X_{k}^{r}}{n} $ is the sample moment of order $ r $.

Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.

References

[1] L.N. Bol’shev, N.V. Smirnov, "Tables of mathematical statistics", Libr. math. tables, 46, Nauka (1983). (In Russian) (Processed by L.S. Bark and E.S. Kedrova)
[2] B.L. van der Waerden, "Mathematische Statistik", Springer (1957).
[3] A.A. Borovkov, "Mathematical statistics", Moscow (1984). (In Russian)

Comments

The use of the empirical distribution in statistics and the associated theory has been greatly developed in recent years. This has been surveyed in [a2]. For developments in strong convergence theory associated with the empirical distribution, see [a1].

References

[a1] M. Csörgö, P. Révész, "Strong approximation in probability and statistics", Acad. Press (1981).
[a2] G.R. Shorack, J.A. Wellner, "Empirical processes with applications to statistics", Wiley (1986).
[a3] M. Loève, "Probability theory", Princeton Univ. Press (1963), pp. Sect. 16.3.
[a4] P. Gaenssler, W. Stute, "Empirical processes: a survey of results for independent and identically distributed random variables", Ann. Prob., 7 (1977), pp. 193–243.
How to Cite This Entry:
Empirical distribution. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Empirical_distribution&oldid=11280
This article was adapted from an original article by A.V. Prokhorov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article