Statistical estimator

A function of random variables that can be used in estimating unknown parameters of a theoretical probability distribution. Methods of the theory of statistical estimation form the basis of the modern theory of errors; physical constants to be measured are commonly used as the unknown parameters, while the results of direct measurements subject to random errors are taken as the random variables. For example, if $X _ {1} \dots X _ {n}$ are independent, identically normally distributed random variables (the results of equally accurate measurements subject to independent normally distributed random errors), then for the unknown mean value $a$( the value of an approximately measurable physical constant) the arithmetical mean

$$\tag{1 } X = \frac{X _ {1} + \dots + X _ {n} }{n}$$

is taken as the statistical estimator.

A statistical estimator as a function of random variables is most frequently given by formulas, the choice of which is prescribed by practical requirements. A distinction must be made here between point and interval estimators.

Point estimators.

A point estimator is a statistical estimator whose value can be represented geometrically in the form of a point in the same space as the values of the unknown parameters (the dimension of the space is equal to the number of parameters to be estimated). In fact, point estimators are also used as approximate values for unknown physical variables. For the sake of simplicity, it is further supposed that one natural parameter is subject to estimation; in this case, a point estimator is a function of the results of observations, and takes numerical values.

A point estimator is said to be unbiased if its mathematical expectation coincides with the parameter being estimated, i.e. if the statistical estimation is free of systematic errors. The arithmetical mean (1) is an unbiased statistical estimator for the mathematical expectation of identically-distributed random variables $X _ {i}$( not necessarily normal). At the same time, the sample variance

$$\tag{2 } \widehat{s} {} ^ {2} = \frac{( X _ {1} - \overline{X}\; ) ^ {2} + \dots + ( X _ {n} - \overline{X}\; ) ^ {2} }{n}$$

is a biased statistical estimator for the variance $\sigma ^ {2} = {\mathsf D} X _ {i}$, since ${\mathsf E} {\widehat{s} } {} ^ {2} = ( 1- 1/n) \sigma ^ {2}$; the function

$$s ^ {2} = \frac{n}{n-} 1 {\widehat{s} } {} ^ {2}$$

is usually taken as the unbiased statistical estimator for $\sigma ^ {2}$.

As a measure of the accuracy of the unbiased statistical estimator $\alpha$ for a parameter $a$ one most often uses the variance ${\mathsf D} \alpha$.

The statistical estimator with smallest variance is called the best. In the example quoted, the arithmetical mean (1) is the best statistical estimator. However, if the probability distribution of the random variables $X _ {i}$ is different from normal, then (1) need not be the best statistical estimator. For example, if the results of the observations of $X _ {i}$ are uniformly distributed in an interval $( b, c)$, then the best statistical estimator for the mathematical expectation $a = ( b+ c)/2$ will be half the sum of the boundary values:

$$\tag{3 } \alpha = \frac{\min X _ {i} + \max X _ {i} }{2} .$$

The criterion for the comparison of the accuracy of different statistical estimators ordinarily used is the relative efficiency — the ratio of the variances of the best estimator and the given unbiased estimator. For example, if the results of the observations of $X _ {i}$ are uniformly distributed, then the variances of the estimators (1) and (3) are expressed by the formulas

$${\mathsf D} \overline{X}\; = \frac{( c- b) ^ {2} }{12n}$$

and

$$\tag{4 } {\mathsf D} \alpha = \frac{( c- b) ^ {2} }{2( n+ 1) ( n+ 2) } .$$

Since (3) is the best estimator, the relative efficiency of the estimator (1) in the given case is

$$e _ {n} ( \overline{X}\; ) = \frac{6n}{( n+ 1)( n+ 2) } \sim \frac{6}{n} .$$

For a large number of observations $n$, it is usually required that the chosen statistical estimator tends in probability to the true value of the parameter $a$, i.e. that for every $\epsilon > 0$,

$$\lim\limits _ {n \rightarrow \infty } {\mathsf P} \{ | \alpha - a | > \epsilon \} = 0;$$

such statistical estimators are called consistent (for example, any unbiased estimator with variance tending to zero, when $n \rightarrow \infty$, is consistent; see also Consistent estimator). Insofar as the order of tendency to the limit is of significance, the asymptotically best estimators are the asymptotically efficient statistical estimators, i.e. those for which

$$\frac{ {\mathsf E} ( \alpha - a) }{\sqrt { {\mathsf E} ( \alpha - a) ^ {2} } } \rightarrow 0 \ \textrm{ and } \ e _ {n} ( \alpha ) \rightarrow 1,$$

when $n \rightarrow \infty$. For example, if $X _ {1} \dots X _ {n}$ are identically normally distributed, then (2) is an asymptotically efficient estimator for the unknown parameter $\sigma ^ {2} = {\mathsf D} X _ {i}$, since, when $n \rightarrow \infty$, the variance of $\widehat{s} {} ^ {2}$ and that of the best estimator $\widehat{s} {} ^ {2} n/( n- 1)$ are asymptotically equivalent:

$$\frac{ {\mathsf D} {\widehat{s} } {} ^ {2} }{ {\mathsf D} [ {\widehat{s} } {} ^ {2} n/( n- 1)] } = \ \frac{n}{( n- 1) ^ {2} } ,\ \ {\mathsf D} {\widehat{s} } {} ^ {2} = \ \frac{2 \sigma ^ {4} }{n-} 1 ,$$

and, moreover,

$${\mathsf E} ( {\widehat{s} } {} ^ {2} - \sigma ^ {2} ) = \frac{- \sigma ^ {2} }{n} .$$

Of prime importance in the theory of statistical estimation and its applications is the fact that the quadratic deviation of a statistical estimator for a parameter $a$ is bounded from below by a certain quantity (R. Fisher proposed that this quantity be characterized by the amount of information regarding the unknown parameter $a$ contained in the results of the observations). For example, if $X _ {1} \dots X _ {n}$ are independent and identically distributed, with probability density $p( x; a)$, and if $\alpha = \phi ( X _ {1} \dots X _ {n} )$ is a statistical estimator for a certain function $g( a)$ of the parameter $a$, then in a broad class of cases

$$\tag{5 } {\mathsf E} [ \alpha - g( a)] ^ {2} \geq \frac{nb ^ {2} ( a) I( a) + [ g ^ \prime ( a) + b ^ \prime ( a)] ^ {2} }{nI( a) } ,$$

where

$$b( a) = {\mathsf E} [ \alpha - g( a)] \ \textrm{ and } \ \ I( a) = {\mathsf E} \left [ \frac{\partial \mathop{\rm ln} p( X; a) }{\partial a } \right ] ^ {2} .$$

The function $b( a)$ is called the bias, while the quantity inverse to the right-hand side of inequality (5) is called the Fisher information, with respect to the function $g( a)$, contained in the results of the observations. In particular, if $\alpha$ is an unbiased statistical estimator of the parameter $a$, then

$$g( a) \equiv a,\ b( a) \equiv 0 ,$$

and

$$\tag{6 } {\mathsf E} [ \alpha - g( a)] ^ {2} = {\mathsf D} \alpha \geq \frac{1}{nI(} a) ,$$

whereby the information $nI( a)$ in this instance is proportional to the number of observations (the function $I( a)$ is called the information contained in one observation).

The basic conditions under which the inequalities (5) and (6) hold are smoothness of the estimator $\alpha$ as a function of $X _ {i}$, and the independence of the parameter $a$ of the set of those points $x$ where $p( x; a) = 0$. The latter condition is not fulfilled, for example, in the case of a uniform distribution, and the variance of the estimator (3) does therefore not satisfy inequality (6) (according to (4), this variance is a quantity of order $n ^ {-} 2$, while, according to inequality (6), it cannot have an order of smallness higher than $n ^ {-} 1$).

The inequalities (5) and (6) also hold for discretely distributed random variables $X _ {i}$: In defining the information $I( a)$, the density $p( x; a)$ must be replaced by the probability of the event $\{ X = x \}$.

If the variance of an unbiased statistical estimator $\alpha ^ {*}$ for the parameter $a$ coincides with the right-hand side of inequality (6), then $\alpha ^ {*}$ is the best estimator. The converse assertion, generally speaking, is not true: The variance of the best statistical estimator can exceed $[ nI( a)] ^ {-} 1$. However, as $n \rightarrow \infty$, the variance of the best estimator, ${\mathsf D} \alpha ^ {*}$, is asymptotically equivalent to the right-hand side of (6), i.e. $n {\mathsf D} \alpha ^ {*} \rightarrow 1/I( a)$. In this way, using the Fisher information, it is possible to define the asymptotic efficiency of an unbiased statistical estimator $\alpha$, by proposing

$$\tag{7 } e _ \infty ( \alpha ) = \ \lim\limits _ {n \rightarrow \infty } \frac{ {\mathsf D} \alpha ^ {*} }{ {\mathsf D} \alpha } = \ \lim\limits _ {n \rightarrow \infty } \frac{1}{nI( a) {\mathsf D} \alpha } .$$

One information approach to the theory of statistical estimators which proves to be particularly fruitful is that where the density (in the discrete instance, the probability) of the joint distribution of the random variables $X _ {1} \dots X _ {n}$ can be represented in the form of the product of two functions $h( x _ {1} \dots x _ {n} ) q[ y( x _ {1} \dots x _ {n} ); a]$, the first of which does not depend on $a$ while the second is the density of the distribution of a certain random variable $Z = y( X _ {1} \dots X _ {n} )$, called a sufficient statistic.

One of the most frequently used methods of finding point estimators is the method of moments (cf. Moments, method of (in probability theory)). According to this method, a theoretical distribution dependent on unknown parameters corresponds to a discrete sample distribution, which is defined by the results of observations of $X _ {i}$ and which is the probability distribution of a theoretical random variable which takes the values $X _ {1} \dots X _ {n}$ with identical probabilities equal to $1/n$( the sample distribution can be seen as a point estimator for the theoretical distribution). The statistical estimator for the moments of a theoretical distribution is taken to be that of the corresponding moments of the sample distribution; for example, for the mathematical expectation $a$ and variance $\sigma ^ {2}$, the method of moments provides the following statistical estimators: the sample mean (1) and the sample variance (2). The unknown parameters are usually expressed (exactly or approximately) in the form of functions of several moments of the theoretical distribution. By replacing theoretical moments in these functions by sample moments, the required statistical estimators are obtained. This method, which in practice often reduces to comparatively simple calculations, generally gives a statistical estimator of low asymptotic efficiency (see the above example of the estimator of the mathematical expectation of a uniform distribution).

Another method for finding statistical estimators, which is more complete from the theoretical point of view, is the maximum-likelihood method. According to this method, the likelihood function $L( a)$ is considered, which is a function of the unknown parameter $a$, and which is obtained as a result of substituting the random variables $X _ {i}$ in the density $p( x _ {1} \dots x _ {n} ; n)$ of the joint distribution for the arguments; if the $X _ {i}$ are independent and identically distributed with probability density $p( x; a)$, then

$$L( a) = p( X _ {1} ; a) \dots p( X _ {n} ; a)$$

(if the $X _ {i}$ are discretely distributed, then in defining the likelihood function $L$ the density should be replaced by the probability of the events $\{ X _ {i} = x _ {i} \}$). The variable $\alpha$ for which $L( \alpha )$ has its largest value is used as the maximum-likelihood estimator for the unknown parameter $a$( instead of $L$, the so-called logarithmic likelihood function is often considered: $l( \alpha ) = \mathop{\rm ln} L( \alpha )$; owing to the monotone nature of the logarithm, the maximum points of $L( \alpha )$ and $l( \alpha )$ coincide).

The basic merit of maximum-likelihood estimators lies in the fact that, given certain general conditions, they are consistent, asymptotically efficient and approximately normally distributed. These properties mean that if $\alpha$ is a maximum-likelihood estimator, then, when $n \rightarrow \infty$,

$${\mathsf E} \alpha \sim a \ \textrm{ and } \ \ {\mathsf E} ( \alpha - a) ^ {2} \sim {\mathsf D} \alpha \sim \sigma _ {n} ^ {2} ( a) = \frac{1}{ {\mathsf E} \left [ \frac{d}{da} l ( a) \right ] ^ {2} }$$

(if the $X _ {i}$ are independent, then $\sigma _ {n} ^ {2} ( a) = [ nI( a)] ^ {-} 1$). Thus, for the distribution function of a normalized statistical estimator $( \alpha - a)/ \sigma _ {n} ( a)$, the limit relation

$$\tag{8 } \lim\limits _ {n \rightarrow \infty } {\mathsf P} \left \{ \frac{\alpha - a }{\sigma _ {n} ( a) } < x \right \} = \ \frac{1}{\sqrt {2 \pi } } \int\limits _ {- \infty } ^ { x } e ^ {- t ^ {2} /2 } dt \equiv \ \Phi ( x)$$

holds.

The advantages of the maximum-likelihood estimator justify the amount of calculation involved in seeking the maximum of the function $L$( or $l$). In certain cases, the amount of calculation is greatly reduced as a result of the following properties: firstly, if $\alpha ^ {*}$ is a statistical estimator for which inequality (6) becomes an equality, then the maximum-likelihood estimator is unique and coincides with $\alpha ^ {*}$; secondly, if a sufficient statistic $Z$ exists, then the maximum-likelihood estimator is a function of $Z$.

For example, let $X _ {1} \dots X _ {n}$ be independent and normally distributed, and such that

$$p( x; a, \sigma ) = \ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \left \{ - \frac{1}{2 \sigma ^ {2} } ( x - a) ^ {2} \right \} ,$$

then

$$l( a, \sigma ) = \mathop{\rm ln} L( a, \sigma ) =$$

$$= \ - \frac{n}{2} \mathop{\rm ln} ( 2 \pi ) - n \mathop{\rm ln} \sigma - \frac{1}{2 \sigma ^ {2} } \sum _ { i= } 1 ^ { n } ( X _ {i} - a) ^ {2} .$$

The coordinates $a = a _ {0}$ and $\sigma = \sigma _ {0}$ of the maximum point of the function $I( a, \sigma )$ satisfy the system of equations

$$\frac{\partial l }{\partial a } \equiv \ \frac{1}{\sigma ^ {2} } \sum ( X _ {i} - a) = 0,$$

$$\frac{\partial l }{\partial a } \equiv - \frac{n}{\sigma ^ {3} } \left [ \sigma ^ {2} - \frac{1}{n} \sum ( X _ {i} - a) ^ {2} \right ] = 0.$$

Thus, $a _ {0} = \overline{X}\; = \sum X _ {i/n}$, $\sigma _ {0} ^ {2} = {\widehat{s} } {} ^ {2} = \sum ( X _ {i} - \overline{X}\; ) ^ {2} /n$, and in the given case (1) and (2) are maximum-likelihood estimators, whereby $\overline{X}\;$ is the best statistical estimator of the parameter $a$, normally distributed ( ${\mathsf E} \overline{X}\; = a$, ${\mathsf D} \overline{X}\; = \sigma ^ {2} /n$), while ${\widehat{s} } {} ^ {2}$ is an asymptotically efficient statistical estimator of the parameter $\sigma ^ {2}$, distributed approximately normally for large $n$( ${\mathsf E} {\widehat{s} } {} ^ {2} \sim \sigma ^ {2}$, ${\mathsf D} {\widehat{s} } {} ^ {2} \sim 2 \sigma ^ {4} /n$). Both estimators are independent sufficient statistics.

As a further example, suppose that

$$p( x; a) = \{ \pi [ 1+( x- a) ^ {2} ] \} .$$

This density gives a satisfactory description of the distribution of one of the coordinates of the particles reaching a plane screen and emanating from a point outside the screen ( $a$ is the coordinate of the projection of the source onto the screen, and is presumed to be unknown). The mathematical expectation of this distribution does not exist, since the corresponding integral is divergent. For this reason it is not possible to find a statistical estimator of $a$ by means of the method of moments. The formal use of the arithmetical mean (1) as a statistical estimator is meaningless, since $\overline{X}\;$ is distributed in the given instance with the same density $p( x; a)$ as every single result of the observations. For estimation of $a$ it is possible to make use of the property that the distribution in question is symmetric relative to the point $x= a$, where $a$ is the median of the theoretical distribution. By slightly modifying the method of moments, the sample median $\mu$ can be used as a statistical estimator. When $n \geq 3$, it is unbiased for $a$ and if $n$ is large, $\mu$ is distributed approximately normally with variance

$${\mathsf D} \mu \sim \frac{\pi ^ {2} }{4n} .$$

At the same time,

$$l( a) = - n \mathop{\rm ln} \pi + \sum _ { i= } 1 ^ { n } \mathop{\rm ln} [ 1 + ( X _ {i} - a) ^ {2} ],$$

thus $nl( a) = n/2$ and, according to (7), the asymptotic efficiency $e _ \infty ( \mu )$ is equal to $8/ \pi ^ {2} \approx 0.811$. Thus, in order that the sample median $\mu$ is as accurate a statistical estimator for $a$ as the maximum-likelihood estimator $\alpha$, the number of observations has to be increased by $25\pct$. If the losses in the experiment are great, then, in the definition of $a$, that statistical estimator $\alpha$ must be used, which, in the given case, is defined as the root of the equation

$$\frac{\partial l }{\partial a } \equiv - 2 \sum _ { i= } 1 ^ { n } \frac{X _ {i} - a }{1 + ( X _ {i} - a) ^ {2} } = 0.$$

As a first approximation, $\alpha _ {0} = \mu$ is used, and this equation is then solved by successive approximation using the formula

$$\alpha _ {k+} 1 = \alpha _ {k} + \frac{4}{n} \sum _ { i= } 1 ^ { n } \frac{X _ {i} - \alpha _ {k} }{1 + ( X _ {i} - \alpha _ {k} ) ^ {2} } .$$

Interval estimators.

An interval estimator is a statistical estimator which is represented geometrically as a set of points in the parameter space. An interval estimator can be seen as a set of point estimators. This set depends on the results of observations, and is consequently random; every interval estimator is therefore (partly) characterized by the probability with which this estimator will "cover" the unknown parameter point. This probability, in general, depends on unknown parameters; therefore, as a characteristic of the reliability of an interval estimator a confidence coefficient is used; this is the lowest possible value of the given probability. Interesting statistical conclusions can be drawn for only those interval estimators which have a confidence coefficient close to one.

If a single parameter $a$ is estimated, then an interval estimator is usually a certain interval $( \beta , \gamma )$( the so-called confidence interval), the end-points $\beta$ and $\gamma$ of which are functions of the observations; the confidence coefficient $\omega$ in the given case is defined as the lower bound of the probability of the simultaneous realization of the two events $\{ \beta < a \}$ and $\{ \gamma > a \}$, which can be calculated using all possible values of the parameter $a$:

$$\omega = \inf _ { a } {\mathsf P} \{ \beta < a < \gamma \} .$$

If the mid-point $( \beta + \gamma )/2$ of such an interval is taken as a point estimator for the parameter $a$, then it can be claimed, with probability not less that $\omega$, that the absolute error of this statistical estimator does not exceed half the length of the interval, $( \gamma - \beta )/2$. In other words, if one is guided by the rule of estimation of the absolute error, then an erroneous conclusion will be obtained on the average in less than $100( 1- \omega )\pct$ of the cases. Given a fixed confidence coefficient $\omega$, the most suitable are the shortest confidence intervals for which the mathematical expectation of the length ${\mathsf E} ( \gamma - \beta )$ attains its lowest value.

If the distribution of random variables $X _ {i}$ depends only on one unknown parameter $a$, then the construction of the confidence interval is usually realized by the use of a certain point estimator $\alpha$. For the majority of cases of practical interest, the distribution function ${\mathsf P} \{ \alpha < x \} = F( x; a)$ of a sensibly chosen statistical estimator $\alpha$ depends monotonically on the parameter $a$. Under these conditions, when seeking an interval estimator it makes sense to insert $x = \alpha$ in $F( x; a)$ and to determine the roots $a _ {1} = a _ {1} ( \alpha , \omega )$ and $a _ {2} = a _ {2} ( \alpha , \omega )$ of the equations

$$\tag{9 } F( \alpha ; a _ {1} ) = \ \frac{1 - \omega }{2} \ \textrm{ and } \ \ F( \alpha + 0; a _ {2} ) = \frac{1 + \omega }{2} ,$$

where

$$F( x+ 0; a) = \lim\limits _ {\Delta \rightarrow 0 } F( x + \Delta ^ {2} ; a)$$

(for continuous distributions $F( x+ 0; a) = F( x; a)$). The points with coordinates $a _ {1} ( \alpha ; \omega )$ and $a _ {2} ( \alpha ; \omega )$ bound the confidence interval with confidence coefficient $\omega$. It is reasonable to expect that such a simply constructed interval differs in many cases from the optimal (shortest) interval. However, if $\alpha$ is an asymptotically efficient statistical estimator for $a$, then, given a sufficiently large number of observations, such an interval estimator differs from the optimal, although in practice the difference is immaterial. This is particularly true for maximum-likelihood estimators, since they are asymptotically normally distributed (see (8)). In cases where solving the equations (9) is difficult, the interval estimator is calculated approximately, using a maximum-likelihood point estimator and the relation (8):

$$\beta \approx \beta ^ {*} = \alpha - x \sigma _ {n} ( \alpha ) \ \textrm{ and } \ \ \gamma \approx \gamma ^ {*} = \alpha + x \sigma _ {n} ( \alpha ) ,$$

where $x$ is the root of the equation $\phi ( x) = ( 1+ \omega )/2$.

If $n \rightarrow \infty$, then the true confidence coefficient of the interval estimator $( \beta ^ {*} , \gamma ^ {*} )$ tends to $\omega$. In a more general case, the distribution of the results of observations $X _ {i}$ depends on various parameters $a, b , . . .$. Then the above rules for the construction of confidence intervals often prove to be not feasible, since the distribution of a point estimator $\alpha$ depends, as a rule, not only on $a$, but also on other parameters. However, in cases of practical interest the statistical estimator $\alpha$ can be replaced by a function of the observations $X _ {i}$ and an unknown parameter $a$, the distribution of which does not depend (or "nearly does not depend" ) on all unknown parameters. An example of such a function is a normalized maximum-likelihood estimator $( \alpha - a)/ \sigma _ {n} ( a, b , . . . )$; if in the denominator the arguments $a, b , . . .$ are replaced by maximum-likelihood estimators $\alpha , \beta \dots$ then the limit distribution will remain the same as in formula (8). The approximate confidence intervals for each parameter in isolation can therefore be constructed in the same way as in the case of a single parameter.

As has already been noted, if $X _ {1} \dots X _ {n} \dots$ are independent and identically normally distributed random variables, then $\overline{X}\;$ and $s ^ {2}$ are the best statistical estimators for the parameters $a$ and $\sigma ^ {2}$, respectively. The distribution function of the statistical estimator is expressed by the formula

$${\mathsf P} \{ \overline{X}\; < x \} = \Phi \left [ \frac{\sqrt n ( x- a) } \sigma \right ]$$

and, consequently, it depends not only on $a$ but also on $\sigma$. At the same time, the distribution of the so-called Student statistic

$$\frac{\sqrt n ( \overline{X}\; - a) }{s} = \tau$$

does not depend on $a$ or $\sigma$, and

$${\mathsf P} \{ | \tau | \leq t \} = \ \omega _ {n-} 1 ( t) = C _ {n-} 1 \int\limits _ { 0 } ^ { t } \left ( 1+ \frac{\nu ^ {2} }{n-} 1 \right ) ^ {-} n/2 d \nu ,$$

where the constant $C _ {n-} 1$ is chosen so that the equality $\omega _ {n-} 1 ( \infty ) = 1$ is satisfied. Thus, the confidence coefficient $\omega _ {n-} 1 ( t)$ corresponds to the confidence interval

$${\overline{X}\; - } \frac{st}{\sqrt n } < a < {\overline{X}\; + } \frac{st}{\sqrt n } .$$

The distribution of the estimator $s ^ {2}$ depends only on $\sigma ^ {2}$, while the distribution function of $s ^ {2}$ is defined by the formula

$${\mathsf P} \left \{ s ^ {2} < \frac{\sigma ^ {2} x }{n-} 1 \right \} = \ G _ {n-} 1 ( x) = \ D _ {n-} 1 \int\limits _ { 0 } ^ { x } v ^ {(} n- 3)/2 e ^ {-} v/2 dv,$$

where the constant $D _ {n-} 1$ is defined by the condition $G _ {n-} 1 ( \infty ) = 1$( the so-called $\chi ^ {2}$- distribution with $n- 1$ degrees of freedom, cf. Chi-squared distribution). Since the probability ${\mathsf P} \{ s ^ {2} < \sigma ^ {2} x/( n- 1) \}$ increases monotonically when $\sigma$ increases, rule (9) can be used to construct an interval estimator. Thus, if $x _ {1}$ and $x _ {2}$ are the roots of the equations $G _ {n-} 1 ( x _ {1} ) = ( 1- \omega )/2$ and $G _ {n-} 1 ( x _ {2} ) = ( 1+ \omega )/2$, then the confidence coefficient $\omega$ corresponds to the confidence interval

$$\frac{( n- 1) s ^ {2} }{x _ {2} } < \sigma ^ {2} < \frac{( n- 1) s ^ {2} }{x _ {1} } .$$

Hence it follows that the confidence interval for the relative error is defined by the inequalities

$$\frac{x _ {1} }{n-} 1 - 1 < \frac{s ^ {2} - \sigma ^ {2} }{\sigma ^ {2} } < \frac{x _ {2} }{n-} 1 - 1.$$

Detailed tables of the Student distribution function $\omega _ {n-} 1 ( t)$ and of the $\chi ^ {2}$- distribution $G _ {n-} 1 ( x)$ can be found in most textbooks on mathematical statistics.

Until now it has been supposed that the distribution function of the results of observations is known up to values of various parameters. However, in practice the form of the distribution function is often unknown. In this case, when estimating the parameters, the so-called non-parametric methods in statistics can prove useful (i.e. methods which do not depend on the initial probability distribution). Suppose, for example, that the median $m$ of a theoretical continuous distribution of independent random variables $X _ {1} \dots X _ {n}$ has to be estimated (for symmetric distributions, the median coincides with the mathematical expectation, provided, of course, that it exists). Let $Y _ {1} \leq \dots \leq Y _ {n}$ be the same variables $X _ {i}$ arranged in ascending order. Then, if $k$ is an integer which satisfies the inequalities $1 \leq k \leq n/2$,

$${\mathsf P} \{ Y _ {k} < m < Y _ {n-} k+ 1 \} = \ 1- 2 \sum _ { r= } 0 ^ { k- } 1 \left ( \begin{array}{c} n \\ r \end{array} \right ) \left ( \frac{1}{2} \right ) ^ {n} = \ \omega _ {n,k} .$$

Thus, $( Y _ {k} , Y _ {n-} k+ 1 )$ is an interval estimator for $m$ with confidence coefficient $\omega = \omega _ {n,k}$. This conclusion holds for any continuous distribution of the random variables $X _ {i}$.

It has already been noted that a sample distribution is a point estimator for an unknown theoretical distribution. Moreover, the sample distribution function $F _ {n} ( x)$ is an unbiased estimator for a theoretical distribution function $F( x)$. Here, as A.N. Kolmogorov demonstrated, the distribution of the statistic

$$\lambda _ {n} = \sqrt n \max _ {- \infty < x < \infty } | F _ {n} ( x) - F( x) |$$

does not depend on the unknown theoretical distribution and, when $n \rightarrow \infty$, tends to a limit distribution $K( y)$, which is called a Kolmogorov distribution. Thus, if $y$ is the solution of the equation $K( y) = \omega$, then it can be claimed, with probability $\omega$, that the graph of the function of the theoretical distribution function $F( y)$ is completely "covered" by a strip enclosed between the graphs of the functions $F _ {n} ( x) \pm y/ \sqrt n$( when $n \geq 20$, the difference between the exact and limit distributions of the statistic $\lambda _ {n}$ is immaterial). An interval estimator of this type is called a confidence region. See also Interval estimator.

Statistical estimators in the theory of errors.

The theory of errors is an area of mathematical statistics devoted to the numerical determination of unknown variables by means of results of measurements. Owing to the random nature of measurement errors, and possibly of the actual phenomenon being studied, these results are not all equally correct: when measurements are repeated, some results are encountered more frequently, some less frequently.

The theory of errors is based on a mathematical model according to which the totality of all conceivable results of the measurements is treated as the set of values of a certain random variable. The theory of statistical estimators is therefore of considerable importance. The conclusions drawn from the theory of errors are of a statistical character. The sense and content of these conclusions (and indeed of the conclusions of the theory of statistical estimation) become clear only in the light of the law of large numbers (an example of this approach is the statistical interpretation of the sense of the confidence coefficient discussed above).

In proposing the result of a measurement $X$ of a random variable, there are three separate basic types of error measurements: systematic, random and gross (qualitative descriptions of these errors are given under Errors, theory of). Here, the difference $X- a$ is called the error of the measurement of the unknown variable $a$; the mathematical expectation of this difference, ${\mathsf E} ( X- a) = b$, is called the systematic error (if $b= 0$, then the measurements are said to be free of systematic errors), while the difference $\delta = X- a- b$ is called the random error ( ${\mathsf E} \delta = 0$). Thus, if $n$ independent measurements of the variable $a$ are taken, then their results can be written in the form of the equalities

$$\tag{10 } X _ {i} = a + b + \delta _ {i} ,\ \ i = 1 \dots n,$$

where $a$ and $b$ are constants, while $\delta _ {i}$ are random variables. In a more general case

$$\tag{11 } X _ {i} = a + ( b + \beta _ {i} ) + \delta _ {i} ,\ \ i = 1 \dots n,$$

where $\beta _ {i}$ are random variables which do not depend on $\delta _ {i}$, and which are equal to zero with probability very close to one (every other value $\beta _ {i} \neq 0$ is therefore improbable). The values $\beta _ {i}$ are called the gross errors (or outliers).

The problem of estimating (and eliminating) systematic errors does not normally fall within the limits of mathematical statistics. Two exceptions to this rule are the standard method, in which, when estimating $b$, a series of measurements of the known value $a$ is made (in this method, $b$ is a value to be estimated and $a$ is a known systematic error) and dispersion analysis, in which the systematic divergence between various series of measurements is estimated.

The fundamental problem in the theory of errors is to find a statistical estimator for an unknown variable $a$ and to estimate the accuracy of the measurements. If the systematic error is eliminated $( b= 0)$ and the observations do not contain gross errors, then according to (10), $X _ {i} = a + \delta _ {i}$, and in this case the problem of estimating $a$ reduces to the problem of finding the optimal statistical estimator in one sense or another for the mathematical expectation of the identically distributed random variables $X _ {i}$. As shown above, the form of such a statistical (point or interval) estimator depends essentially on the distribution law of the random errors. If this law is known up to various unknown parameters, then the maximum-likelihood method can be used to find an estimator for $a$; in the alternative case, a statistical estimator for an unknown distribution function of the random errors $\delta _ {i}$ has to be found, using the results of the observations of $X _ {i}$( the "non-parametric" interval estimator of this function is shown above). In practice, two statistical estimators $\overline{X}\; \approx a$ and $s ^ {2} \approx {\mathsf D} \delta _ {i}$ often suffice (see (1) and (2)). If $\delta _ {i}$ are identically normally distributed, then these statistical estimators are the best; in other cases, these estimators can prove to be quite inefficient.

The appearance of outliers (gross errors) complicates the problem of estimating the parameter $a$. The proportion of observations in which $\beta _ {i} \neq 0$ is usually small, while the mathematical expectation of non-zero $| \beta _ {i} |$ is significantly higher than $\sqrt { {\mathsf D} \delta _ {i} }$( gross errors arise as a result of random miscalculation, incorrect reading of the measuring equipment, etc.). Results of measurements which contain gross errors are often easily spotted, as they differ greatly from the other results. Under these conditions, the most advisable means of identifying (and eliminating) gross errors is to carry out a direct analysis of the measurements, to check carefully that all experiments were carried out under the same conditions, to make a "double note" of the results, etc. Statistical methods of finding gross errors are only to be used in cases of doubt.

The simplest example of these methods is the statistical occurrence of an outlier, when either $Y _ {1} = \min X _ {1}$ or $Y _ {n} = \max X _ {i}$ is open to doubt (it is proposed that in the equalities (11) $b= 0$ and that the distribution law of the variables $\delta _ {i}$ is known). In order to establish whether the hypothesis of the presence of an outlier is justified, a joint interval estimator (or prediction region) for the pair $Y _ {1} , Y _ {n}$ is calculated (a confidence region), by proposing that all $\beta _ {i}$ are equal to zero. If this statistical estimator "covers" the point with coordinates $( Y _ {1} , Y _ {n} )$, then the doubt over the presence of an outlier has to be considered statistically unjustified; in the alternative case, the hypothesis of the absence of an outlier has to be accepted (the rejected theory is then usually discarded, as it is statistically impossible to reliably estimate the value of the outlier at all using one observation).

For example, let $a$ be unknown, let $b= 0$ and let $\delta _ {i}$ be independent and identically normally distributed (the variance is unknown). If all $\beta _ {i} = 0$, then the distribution of the random variable

$$Z = \frac{\max | X _ {i} - \overline{X}\; | }{\widehat{s} }$$

does not depend on unknown parameters (the statistical estimators $X$ and $\widehat{s}$ are calculated, using all $n$ observations, according to the formulas (1) and (2)). For large values

$${\mathsf P} \{ Z > z \} \approx n \left [ 1 - \omega _ {n-} 2 \left ( z {\sqrt { \frac{n- 2 }{n- 1- z ^ {2} } } } \right ) \right ] ,$$

where $\omega _ {r} ( t)$ is the Student distribution function, as defined above. Thus, with confidence coefficient

$$\tag{12 } \omega \approx 1 - n \left [ 1 - \omega _ {n-} 2 \left ( z \sqrt {n- \frac{2}{n- 1- z ^ {2} } } \right ) \right ]$$

it can be claimed that in the absence of an outlier the inequality $Z < z$ is satisfied, or, put another way,

$$\overline{X}\; - z \widehat{s} < Y _ {1} < Y _ {n} < \overline{X}\; + z \widehat{s} .$$

(The error in the estimation of the confidence coefficient by means of formula (12) does not exceed $\omega ^ {2} /2$.) Therefore, if all results of the measurements of $X _ {i}$ fall within the limits $X \pm z \widehat{s}$, then there are no grounds for supposing that any measurement contains an outlier.

References

 [1] H. Cramér, M.R. Leadbetter, "Stationary and related stochastic processes" , Wiley (1967) pp. Chapts. 33–34 [2] N.V. Smirnov, I.V. Dunin-Barkovskii, "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft. (1969) (Translated from Russian) [3] Yu.V. Linnik, "Methode der kleinste Quadraten in moderner Darstellung" , Deutsch. Verlag Wissenschaft. (1961) (Translated from Russian) [4] B.L. van der Waerden, "Mathematische Statistik" , Springer (1957) [5] N. Arley, K.R. Buch, "Introduction to the theory of probability and statistics" , Wiley (1950) [6] A.N. Kolmogorov, "On the statistical estimation of the parameters of the Gauss distribution" Izv. Akad. Nauk SSSR Ser. Mat. , 6 : 1–2 (1942) pp. 3–32 (In Russian) (French abstract)