# Empirical distribution

sample distribution

A probability distribution that is determined from a random sample used for the estimation of a true distribution. Suppose that $X_{1},\ldots,X_{n}$ are independent and identically-distributed random variables with distribution function $F$, and let $X_{(1)} \leq \ldots \leq X_{(n)}$ be the corresponding order statistics. The empirical distribution corresponding to $(X_{1},\ldots,X_{n})$ is defined as the discrete distribution that assigns to every value $X_{k}$ the probability $\dfrac{1}{n}$. The empirical distribution function $F_{n}$ is the step-function with steps of multiples of $\dfrac{1}{n}$ at the points defined by $X_{(1)},\ldots,X_{(n)}$: $${F_{n}}(x) = \begin{cases} 0, & \text{if} ~ x \leq X_{(1)}; \\ \dfrac{k}{n}, & \text{if} ~ X_{(k)} < x \leq X_{(k + 1)} ~ \text{and} ~ 1 \leq k \leq n - 1; \\ 1, & \text{if} ~ x > X_{(n)}. \end{cases}$$

For fixed values of $X_{1},\ldots,X_{n}$, the function $F_{n}$ has all the properties of an ordinary distribution function. For every fixed $x \in \mathbf{R}$, the function ${F_{n}}(x)$ is a random variable as a function of $X_{1},\ldots,X_{n}$. Hence, the empirical distribution corresponding to a random sample $(X_{1},\ldots,X_{n})$ is given by the family $({F_{n}}(x))_{x \in \mathbf{R}}$ of random variables. Here, for a fixed $x \in \mathbf{R}$, we have $$\mathsf{E} {F_{n}}(x) = F(x), \qquad \mathsf{D} {F_{n}}(x) = \frac{1}{n} F(x) [1 - F(x)]$$ and $$\mathsf{P} \! \left\{ {F_{n}}(x) = \frac{k}{n} \right\} = \binom{n}{k} [F(x)]^{k} [1 - F(x)]^{n - k}.$$

In accordance with the Law of Large Numbers, ${F_{n}}(x) \to F(x)$ with probability $1$ as $n \to \infty$, for each $x \in \mathbf{R}$. This means that ${F_{n}}(x)$ is an unbiased and consistent estimator of the distribution function $F(x)$. The empirical distribution function converges, uniformly in $x$, with probability $1$ to $F(x)$ as $n \to \infty$, i.e., if $$D_{n} \stackrel{\text{df}}{=} \sup_{x \in \mathbf{R}} |{F_{n}}(x) - F(x)|,$$ then the Glivenko–Cantelli Theorem states that $$\mathsf{P} \! \left\{ \lim_{n \to \infty} D_{n} = 0 \right\} = 1.$$

The quantity $D_{n}$ is a measure of the proximity of ${F_{n}}(x)$ to $F(x)$. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function $F(x)$, we have $$\forall z \in \mathbf{R}_{> 0}: \qquad \lim_{n \to \infty} \mathsf{P} \{ \sqrt{n} D_{n} < z \} = K(z) = \sum_{n = - \infty}^{\infty} (- 1)^{k} e^{- 2 k^{2} z^{2}}.$$

If $F$ is not known, then to verify the hypothesis that it is a given continuous function $F_{0}$, one uses tests based on statistics of type $D_{n}$ (see Kolmogorov test; Kolmogorov–Smirnov test; Non-parametric methods in statistics).

Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, $\displaystyle \bar{X} = \sum_{k = 1}^{n} \frac{X_{k}}{n}$ is the sample mean, $\displaystyle s^{2} = \sum_{k = 1}^{n} \frac{\left( X_{k} - \bar{X} \right)^{2}}{n}$ is the sample variance, and $\displaystyle \widehat{\alpha}_{r} = \sum_{k = 1}^{n} \frac{X_{k}^{r}}{n}$ is the sample moment of order $r$.

Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.

How to Cite This Entry:
Empirical distribution. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Empirical_distribution&oldid=41639
This article was adapted from an original article by A.V. Prokhorov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article