# Statistical estimation

One of the fundamental parts of mathematical statistics, dedicated to the estimation using random observations of various characteristics of their distribution.

### Example 1.

Let $X _ {1} \dots X _ {n}$ be independent random variables (or observations) with a common unknown distribution ${\mathcal P}$ on the straight line. The empirical (sample) distribution ${\mathcal P} _ {n} ^ \star$ which ascribes the weight $1/n$ to every random point $X _ {n}$ is a statistical estimator for ${\mathcal P}$. The empirical moments

$$a _ \nu = \int\limits x ^ \nu d {\mathcal P} _ {n} ^ \star = \ \frac{1}{n} \sum _ { i= } 1 ^ { n } X _ {i}$$

serve as estimators for the moments $\alpha _ \nu = \int x ^ \nu d {\mathcal P}$. In particular,

$$\overline{X}\; = \frac{1}{n} \sum _ { i= } 1 ^ { n } X _ {i}$$

is an estimator for the mean, and

$$s ^ {2} = \frac{1}{n} \sum _ { i= } 1 ^ { n } ( X _ {i} - \overline{X}\; ) ^ {2}$$

is an estimator for the variance.

## Basic concepts.

In the general theory of estimation, an observation of $X$ is a random element with values in a measurable space $( \mathfrak X , \mathfrak A)$, whose unknown distribution belongs to a given family of distributions $P$. The family of distributions can always be parametrized and written in the form $\{ { {\mathcal P} _ \theta } : {\theta \in \Theta } \}$. Here the form of dependence on the parameter and the set $\Theta$ are assumed to be known. The problem of estimation using an observation $X$ of an unknown parameter $\theta$ or of the value $g( \theta )$ of a function $g$ at the point $\theta$ consists of constructing a function $\theta ^ \star ( X)$ from the observations made, which gives a sufficiently good approximation of $\theta$ $( g( \theta ))$.

A comparison of estimators is carried out in the following way. Let a non-negative loss function $w( y _ {1} ; y _ {2} )$ be defined on $\Theta \times \Theta$ $( g( \Theta ) \times g( \Theta ))$, the sense of this being that the use of $\theta ^ \star$ for the actual value of $\theta$ leads to losses $w( \theta ^ \star ; \theta )$. The mean losses and the risk function $R _ {w} ( \theta ^ \star ; \theta ) = {\mathsf E} _ \theta w( \theta ^ \star ; \theta )$ are taken as a measure of the quality of the statistic $\theta ^ \star$ as an estimator of $\theta$ given the loss function $w$. A partial order relation is thereby introduced on the set of estimators: An estimator $T _ {1}$ is preferable to an estimator $T _ {2}$ if $R _ {w} ( T _ {1} ; \theta ) \leq R _ {w} ( T _ {2} ; \theta )$. In particular, an estimator $T$ of the parameter $\theta$ is said to be inadmissible (in relation to the loss function $w$) if an estimator $T ^ \prime$ exists such that $R _ {w} ( T ^ \prime ; \theta ) \leq R _ {w} ( T; \theta )$ for all $\theta \in \Theta$, and for some $\theta$ strict inequality occurs. In this method of comparing the quality of estimators, many estimators prove to be incomparable, and, moreover, the choice of a loss function is to a large extent arbitrary.

It is sometimes possible to find estimators that are optimal within a certain narrower class of estimators. Unbiased estimators form an important class. If the initial experiment is invariant relative to a certain group of transformations, it is natural to restrict to estimators that do not disrupt the symmetry of the problem (see Equivariant estimator).

Estimators can be compared by their behaviour at "worst" points: An estimator $T _ {0}$ of $\theta$ is called a minimax estimator relative to the loss function $w$ if

$$\sup _ \theta R _ {w} ( T _ {0} ; \theta ) = \ \inf _ { T } \sup _ \theta R _ {w} ( T; \theta ) ,$$

where the lower bound is taken over all estimators $T = T( X)$.

In the Bayesian formulation of the problem (cf. Bayesian approach), the unknown parameter is considered to represent values of the random variable with a priori distribution $Q$ on $\Theta$. In this case, the best estimator $T _ {0}$ relative to the loss function $w$ is defined by the relation

$$r _ {w} ( T _ {0} ) = \ {\mathsf E} _ {w} ( T _ {0} ; \theta ) = \ \int\limits _ \Theta {\mathsf E} _ \theta w( T _ {0} ; \theta ) Q( d \theta ) =$$

$$= \ \inf _ { T } \int\limits _ \Theta {\mathsf E} _ \theta w( T; \theta ) Q( d \theta ) ,$$

and the lower bound is taken over all estimators $T = T( X)$.

There is a distinction between parametric estimation problems, in which $\Theta$ is a subset of a finite-dimensional Euclidean space, and non-parametric problems. In parametric problems one usually considers loss functions in the form $l( | \theta _ {1} - \theta _ {2} | )$, where $l$ is a non-negative, non-decreasing function on $\mathbf R ^ {+}$. The most frequently used quadratic loss function $| \theta _ {1} - \theta _ {2} | ^ {2}$ plays an important part.

If $T = T( X)$ is a sufficient statistic for the family $\{ { {\mathcal P} _ \theta } : {\theta \in \Theta } \}$, then it is often possible to restrict to estimators $\theta ^ \star = h( T)$. Thus, if $\Theta \in \mathbf R ^ {k}$, $w( \theta _ {1} ; \theta _ {2} ) = l( | \theta _ {1} - \theta _ {2} | )$, where $l$ is a convex function and $\theta ^ \star$ is any estimator for $\theta$, an estimator $h( T)$ exists that is not worse than $\theta ^ \star$; if $\theta ^ \star$ is unbiased, $h( T)$ can also be chosen unbiased (Blackwell's theorem). If $T$ is a complete sufficient statistic for the family $\{ {\mathcal P} _ \theta \}$ and $\theta ^ \star$ is an unbiased estimator for $g( \theta )$, then an unbiased estimator in the form $h( T)$ with minimum variance in the class of unbiased estimators exists (the Lehmann–Scheffé theorem).

As a rule, it is assumed that in parametric estimation problems the elements of the family $\{ { {\mathcal P} _ \theta } : {\theta \in \Theta } \}$ are absolutely continuous with respect to a certain $\sigma$- finite measure $\mu$ and that the density $d {\mathcal P} _ \theta /d \mu = p( x; \theta )$ exists. If $p( x; \theta )$ is a sufficiently-smooth function of $\theta$ and the Fisher information matrix

$$I( \theta ) = \ \int\limits _ { \mathfrak X } \frac{dp}{d \theta } ( x, \theta ) \left ( \frac{dp}{d \theta } ( x,\ \theta ) \right ) ^ {T} \frac{\mu ( dx) }{p( x; \theta ) }$$

exists, the estimation problem is said to be regular. For regular problems, the accuracy of the estimation is bounded from below by the Cramér–Rao inequality: If $\Theta \subset \mathbf R ^ {1}$, then for any estimator $T$,

$${\mathsf E} _ \theta | T- \theta | ^ {2} \geq \ \frac{( 1+ ( db / {d \theta } ) ( \theta )) ^ {2} }{I( \theta ) } + b ^ {2} ( \theta ) ,\ \ b( \theta ) = {\mathsf E} _ \theta T- \theta .$$

### Examples of estimation problems 2.

The most widespread formulation is that in which a sample of size $n$ is observed: $X _ {1} \dots X _ {n}$ are independent identically-distributed variables taking values in a measurable space $( \mathfrak X , \mathfrak A)$ with common distribution density $f( x, \theta )$ relative to a measure $\nu$, and $\theta \in \Theta$. In regular problems, if $I( \theta )$ is the Fisher information on one observation, then the Fisher information of the whole sample $I _ {n} ( \theta ) = nI( \theta )$. The Cramér–Rao inequality takes the form

$${\mathsf E} _ \theta | T- \theta | ^ {2} \geq \ \frac{( 1+ ( db / {d \theta } )( \theta )) ^ {2} }{nI( \theta ) } + b ^ {2} ( \theta ),\ \$$

$$T = T( X _ {1} \dots X _ {n} ).$$

$2.1$. Let $X _ {j}$ be normal random variables with distribution density

$$\frac{1}{\sqrt {2 \pi } } \mathop{\rm exp} \left \{ - \frac{( x- a) ^ {2} }{2 \sigma ^ {2} } \right \} .$$

Let the unknown parameter be $\theta = ( a, \sigma ^ {2} )$; $\overline{X}\;$ and $s ^ {2}$ can serve as estimators for $a$ and $\sigma ^ {2}$, and $( \overline{X}\; , s ^ {2} )$ is then a sufficient statistic. The estimator $\overline{X}\;$ is unbiased, while $s ^ {2}$ is biased. If $\sigma ^ {2}$ is known, $\overline{X}\;$ is an unbiased estimator of minimal variance, and is a minimax estimator relative to the quadratic loss function.

$2.2$. Let $X _ {j}$ be normal random variables in $\mathbf R ^ {k}$ with density

$$\frac{1}{( 2 \pi ) ^ {k/2} } \mathop{\rm exp} \left \{ \frac{| x- \theta | ^ {2} }{2} \right \} , \ \theta \in \mathbf R ^ {k} .$$

The statistic $\overline{X}\;$ is an unbiased estimator of $\theta$; if $k \leq 2$, it is admissible relative to the quadratic loss function, if $k > 2$, it is inadmissible.

$2.3$. Let $X _ {j}$ be random variables in $\mathbf R ^ {1}$ with unknown distribution density $f$ belonging to a given family $F$ of densities. For a sufficiently broad class $F$, this is a non-parametric problem. The problem of estimating $f( x _ {0} )$ at a point $x _ {0}$ is a problem of estimating the functional $g( f) = f( x _ {0} )$.

### Example 3.

The linear regression model. The variables

$$X _ {i} = \sum _ {\alpha = 1 } ^ { p } a _ {\alpha i } \theta _ \alpha + \xi _ {i}$$

are observed; the $\xi _ {i}$ are random disturbances, $i = 1 \dots n$; the matrix $\| a _ {\alpha i } \|$ is known; and the parameter $( \theta _ {1} \dots \theta _ {p} )$ must be estimated.

### Example 4.

A segment of a stationary Gaussian process $x( t)$, $0 \leq t \leq T$, with rational spectral density $| \sum _ {j=} 0 ^ {m} a _ {j} \lambda ^ {j} | ^ {2} \cdot | \sum _ {j=} 0 ^ {n} b _ {j} \lambda ^ {j} | ^ {-} 2$ is observed; the unknown parameters $\{ a _ {j} \}$, $\{ b _ {j} \}$ are to be estimated.

## Methods of producing estimators.

The most widely used maximum-likelihood method recommends that the estimator $\widehat \theta ( X)$ defined as the maximum point of the random function $p( X; \theta )$ is taken, the so-called maximum-likelihood estimator. If $\Theta \subset \mathbf R ^ {k}$, the maximum-likelihood estimators are to be found among the roots of the likelihood equation

$$\frac{d}{d \theta } \mathop{\rm ln} p( \theta ; X) = 0.$$

In example 3, the method of least squares (cf. Least squares, method of) recommends that the minimum point of the function

$$m( \theta ) = \sum _ { i= } 1 ^ { n } \left ( X _ {i} - \sum _ \alpha a _ {\alpha i } \theta _ \alpha \right ) ^ {2}$$

be used as the estimator.

Another method is to take a Bayesian estimator $T$ relative to a loss function $w$ and an a priori distribution $Q$, although the initial formulation is not Bayesian. For example, if $\Theta = \mathbf R ^ {k}$, it is possible to estimate $\theta$ by means of

$$\frac{\int\limits _ {- \infty } ^ \infty \theta p ( X; \theta ) d \theta }{\int\limits _ {- \infty } ^ \infty p( X; \theta ) d \theta } .$$

This is a Bayesian estimator relative to the quadratic loss function and a uniform a priori distribution.

The method of moments (cf. Moments, method of (in probability theory)) consists of the following. Let $\Theta \subset \mathbf R ^ {k}$, and suppose that there are $k$" good" estimators $a _ {1} ( X) \dots a _ {k} ( X)$ for $\alpha _ {1} ( \theta ) \dots \alpha _ {k} ( \theta )$. Estimators by the method of moments are solutions of the system $\alpha _ {i} ( \theta ) = a _ {i}$. Empirical moments are frequently chosen in the capacity of $a _ {i}$( see example 1).

If the sample $X _ {1} \dots X _ {n}$ is observed, then (see example 1) as an estimator for $g( {\mathcal P})$ it is possible to choose $g( {\mathcal P} _ {n} ^ \star )$. If the function $g( {\mathcal P} _ {n} ^ \star )$ is not defined (for example, $g( {\mathcal P}) = ( d {\mathcal P} /d \lambda )( x)$, where $\lambda$ is Lebesgue measure), appropriate modifications $g _ {n} ( {\mathcal P} _ {n} ^ \star )$ are chosen. For example, for an estimator of the density a histogram or an estimator of the form

$$\int\limits \phi _ {n} ( x- y) d {\mathcal P} _ {n} ^ \star ( y)$$

is used.

## Asymptotic behaviour of estimators.

For the sake of being explicit a problem such as Example 2 is examined, in which $\Theta \subset \mathbf R ^ {k}$. It is to be expected that when $n \rightarrow \infty$, "good" estimators will get infinitely close to the characteristic being estimated. A sequence of estimators $\theta _ {n} ^ \star ( X _ {1} \dots X _ {n} )$ is called a consistent sequence of estimators of $\theta$ if $\theta _ {n} ^ \star \rightarrow \theta$ in the probability $P _ \theta$ for all $\theta$. The above methods of producing estimators lead, under broad hypotheses, to consistent estimators (cf. Consistent estimator). The estimators in example 1 are consistent. For regular estimation problems, maximum-likelihood estimators and Bayesian estimators are asymptotically normal with mean $\theta$ and correlation matrix $( nI( \theta )) ^ {-} 1$. Under such conditions, these estimators are asymptotically locally minimax relative to a broad class of loss functions, and they can be considered as being asymptotically optimal (see Asymptotically-efficient estimator).

## Interval estimation.

A random subset $E = E( X)$ of the set $\Theta$ is called a confidence region for the estimator $\theta$ with confidence coefficient $\gamma$ if $P _ \theta \{ E \supset \theta \} = \gamma$( $\geq \gamma$). Many confidence regions with a given $\gamma$ usually exist, and the problem is to choose the one possessing certain optimal properties (for example, the interval of minimum length, if $\Theta \subset \mathbf R ^ {1}$). Under the conditions of example 2.1, let $\sigma = 1$. Then the interval

$$\left [ \overline{X}\; - \frac \lambda {\sqrt n } , \overline{X}\; + \frac \lambda {\sqrt n } \right ] ,\ \ 1 - \gamma = \sqrt { \frac{2} \pi } \int\limits _ \lambda ^ \infty \mathop{\rm exp} \left \{ - \frac{u ^ {2} }{2} \right \} du ,$$

is a confidence interval with confidence coefficient $\gamma$( see Interval estimator).

#### References

 [1] R.A. Fisher, "On the mathematical foundations of theoretical statistics" Phil. Trans. Roy. Soc. London Ser. A , 222 (1922) pp. 309–368 [2] A.N. Kolmogorov, "Sur l'estimation statistique des paramètres de la loi de Gauss" Izv. Akad. Nauk SSSR Ser. Mat. , 6 : 1 (1942) pp. 3–32 [3] H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946) [4] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979) [5] I.A. Ibragimov, R.Z. [R.Z. Khas'minskii] Has'minskii, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian) [6] N.N. Chentsov, "Statistical decision laws and optimal inference" , Amer. Math. Soc. (1982) (Translated from Russian) [7] S. Zacks, "The theory of statistical inference" , Wiley (1975) [8] U. Grenander, "Abstract inference" , Wiley (1981)