# Empirical distribution

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

sample distribution

A probability distribution that is determined from a random sample used for the estimation of a true distribution. Suppose that $X_{1},\ldots,X_{n}$ are independent and identically-distributed random variables with distribution function $F$, and let $X_{(1)} \leq \ldots \leq X_{(n)}$ be the corresponding order statistics. The empirical distribution corresponding to $(X_{1},\ldots,X_{n})$ is defined as the discrete distribution that assigns to every value $X_{k}$ the probability $\dfrac{1}{n}$. The empirical distribution function $F_{n}$ is the step-function with steps of multiples of $\dfrac{1}{n}$ at the points defined by $X_{(1)},\ldots,X_{(n)}$: $${F_{n}}(x) = \begin{cases} 0, & \text{if} ~ x \leq X_{(1)}; \\ \dfrac{k}{n}, & \text{if} ~ X_{(k)} < x \leq X_{(k + 1)} ~ \text{and} ~ 1 \leq k \leq n - 1; \\ 1, & \text{if} ~ x > X_{(n)}. \end{cases}$$

For fixed values of $X_{1},\ldots,X_{n}$, the function $F_{n}$ has all the properties of an ordinary distribution function. For every fixed $x \in \mathbf{R}$, the function ${F_{n}}(x)$ is a random variable as a function of $X_{1},\ldots,X_{n}$. Hence, the empirical distribution corresponding to a random sample $(X_{1},\ldots,X_{n})$ is given by the family $({F_{n}}(x))_{x \in \mathbf{R}}$ of random variables. Here, for a fixed $x \in \mathbf{R}$, we have $$\mathsf{E} {F_{n}}(x) = F(x), \qquad \mathsf{D} {F_{n}}(x) = \frac{1}{n} F(x) [1 - F(x)]$$ and $$\mathsf{P} \! \left\{ {F_{n}}(x) = \frac{k}{n} \right\} = \binom{n}{k} [F(x)]^{k} [1 - F(x)]^{n - k}.$$

In accordance with the Law of Large Numbers, ${F_{n}}(x) \to F(x)$ with probability $1$ as $n \to \infty$, for each $x \in \mathbf{R}$. This means that ${F_{n}}(x)$ is an unbiased and consistent estimator of the distribution function $F(x)$. The empirical distribution function converges, uniformly in $x$, with probability $1$ to $F(x)$ as $n \to \infty$, i.e., if $$D_{n} \stackrel{\text{df}}{=} \sup_{x \in \mathbf{R}} |{F_{n}}(x) - F(x)|,$$ then the Glivenko–Cantelli Theorem states that $$\mathsf{P} \! \left\{ \lim_{n \to \infty} D_{n} = 0 \right\} = 1.$$

The quantity $D_{n}$ is a measure of the proximity of ${F_{n}}(x)$ to $F(x)$. A.N. Kolmogorov found (in 1933) its limit distribution: For a continuous function $F(x)$, we have $$\forall z \in \mathbf{R}_{> 0}: \qquad \lim_{n \to \infty} \mathsf{P} \{ \sqrt{n} D_{n} < z \} = K(z) = \sum_{n = - \infty}^{\infty} (- 1)^{k} e^{- 2 k^{2} z^{2}}.$$

If $F$ is not known, then to verify the hypothesis that it is a given continuous function $F_{0}$, one uses tests based on statistics of type $D_{n}$ (see Kolmogorov test; Kolmogorov–Smirnov test; Non-parametric methods in statistics).

Moments and any other characteristics of an empirical distribution are called sample or empirical; for example, $\displaystyle \bar{X} = \sum_{k = 1}^{n} \frac{X_{k}}{n}$ is the sample mean, $\displaystyle s^{2} = \sum_{k = 1}^{n} \frac{\left( X_{k} - \bar{X} \right)^{2}}{n}$ is the sample variance, and $\displaystyle \widehat{\alpha}_{r} = \sum_{k = 1}^{n} \frac{X_{k}^{r}}{n}$ is the sample moment of order $r$.

Sample characteristics serve as statistical estimators of the corresponding characteristics of the original distribution.

#### References

 [1] L.N. Bol’shev, N.V. Smirnov, "Tables of mathematical statistics", Libr. math. tables, 46, Nauka (1983). (In Russian) (Processed by L.S. Bark and E.S. Kedrova) [2] B.L. van der Waerden, "Mathematische Statistik", Springer (1957). [3] A.A. Borovkov, "Mathematical statistics", Moscow (1984). (In Russian)