Empirical distribution
sample distribution
A probability distribution that is determined from a random sample used for the estimation of a true distribution. Suppose that $ X_{1},\ldots,X_{n} $ are independent and identically-distributed random variables with distribution function $ F $, and let $ X_{(1)} \leq \ldots \leq X_{(n)} $ be the corresponding order statistics. The empirical distribution corresponding to $ (X_{1},\ldots,X_{n}) $ is defined as the discrete distribution that assigns to every value $ X_{k} $ the probability $ \dfrac{1}{n} $. The empirical distribution function $ F_{n} $ is the step-function with steps of multiples of $ \dfrac{1}{n} $ at the points defined by $ X_{(1)},\ldots,X_{(n)} $: $$ {F_{n}}(x) = \begin{cases} 0, & \text{if} ~ x \leq X_{(1)}; \\ \dfrac{k}{n}, & \text{if} ~ X_{(k)} < x \leq X_{(k + 1)} ~ \text{and} ~ 1 \leq k \leq n - 1; \\ 1, & \text{if} ~ x > X_{(n)}. \end{cases} $$
For fixed values of $ X_{1},\ldots,X_{n} $, the function $ F_{n} $ has all the properties of an ordinary distribution function. For every fixed $ x \in \mathbf{R} $, the function $ {F_{n}}(x) $ is a random variable as a function of $ X_{1},\ldots,X_{n} $. Hence, the empirical distribution corresponding to a random sample $ (X_{1},\ldots,X_{n}) $ is given by the family $ ({F_{n}}(x))_{x \in \mathbf{R}} $ of random variables. Here, for a fixed $ x \in \mathbf{R} $, we have $$ \mathsf{E} {F_{n}}(x) = F(x), \qquad \mathsf{D} {F_{n}}(x) = \frac{1}{n} F(x) [1 - F(x)] $$ and $$ \mathsf{P} \! \left\{ {F_{n}}(x) = \frac{k}{n} \right\} = \binom{n}{k} [F(x)]^{k} [1 - F(x)]^{n - k}. $$
In accordance with the Law of Large Numbers, $ {F_{n}}(x) \to F(x) $ with probability $ 1 $ as $ n \to \infty $, for each $ x \in \mathbf{R} $. This means that $ {F_{n}}(x) $ is an unbiased and consistent estimator of the distribution function $ F(x) $. The empirical distribution function converges, uniformly in $ x $, with probability $ 1 $ to $ F(x) $ as $ n \to \infty $, i.e., if $$ D_{n} \stackrel{\text{df}}{=} \sup_{x \in \mathbf{R}} |{F_{n}}(x) - F(x)|, $$ then the Glivenko–Cantelli Theorem states that $$ \mathsf{P} \! \left\{ \lim_{n \to \infty} D_{n} = 0 \right\} = 1. $$
The quantity $ D_{n} $ is a measure of the proximity of $ {F_{n}}(x) $ to $ F(x) $. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function $ F(x) $, we have $$ \forall z \in \mathbf{R}_{> 0}: \qquad \lim_{n \to \infty} \mathsf{P} \{ \sqrt{n} D_{n} < z \} = K(z) = \sum_{n = - \infty}^{\infty} (- 1)^{k} e^{- 2 k^{2} z^{2}}. $$
If $ F $ is not known, then to verify the hypothesis that it is a given continuous function $ F_{0} $, one uses tests based on statistics of type $ D_{n} $ (see Kolmogorov test; Kolmogorov–Smirnov test; Non-parametric methods in statistics).
Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, $ \displaystyle \bar{X} = \sum_{k = 1}^{n} \frac{X_{k}}{n} $ is the sample mean, $ \displaystyle s^{2} = \sum_{k = 1}^{n} \frac{\left( X_{k} - \bar{X} \right)^{2}}{n} $ is the sample variance, and $ \displaystyle \widehat{\alpha}_{r} = \sum_{k = 1}^{n} \frac{X_{k}^{r}}{n} $ is the sample moment of order $ r $.
Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.
References
[1] | L.N. Bol’shev, N.V. Smirnov, "Tables of mathematical statistics", Libr. math. tables, 46, Nauka (1983). (In Russian) (Processed by L.S. Bark and E.S. Kedrova) |
[2] | B.L. van der Waerden, "Mathematische Statistik", Springer (1957). |
[3] | A.A. Borovkov, "Mathematical statistics", Moscow (1984). (In Russian) |
Comments
The use of the empirical distribution in statistics and the associated theory has been greatly developed in recent years. This has been surveyed in [a2]. For developments in strong convergence theory associated with the empirical distribution, see [a1].
References
[a1] | M. Csörgö, P. Révész, "Strong approximation in probability and statistics", Acad. Press (1981). |
[a2] | G.R. Shorack, J.A. Wellner, "Empirical processes with applications to statistics", Wiley (1986). |
[a3] | M. Loève, "Probability theory", Princeton Univ. Press (1963), pp. Sect. 16.3. |
[a4] | P. Gaenssler, W. Stute, "Empirical processes: a survey of results for independent and identically distributed random variables", Ann. Prob., 7 (1977), pp. 193–243. |
Empirical distribution. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Empirical_distribution&oldid=41639