# Maximum-likelihood method

One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory.

Suppose one has, for an observation $ X $ with distribution $ {\mathsf P} _ \theta $ depending on an unknown parameter $ \theta \in \Theta \subseteq \mathbf R ^ {k} $, the task to estimate $ \theta $. Assuming that all measures $ {\mathsf P} _ \theta $ are absolutely continuous relative to a common measure $ \nu $, the likelihood function is defined by

$$ L ( \theta ) = \ \frac{d {\mathsf P} _ \theta }{d \nu } ( X ) . $$

The maximum-likelihood method recommends taking as an estimator for $ \theta $ the statistic $ \widehat \theta $ defined by

$$ L ( \widehat \theta ) = \ \max _ {\theta \in \Theta ^ {c} } \ L ( \theta ) . $$

$ \widehat \theta $ is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a likelihood equation

$$ \tag{1 } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} L ( \theta ) = 0 ,\ \ i = 1 \dots k ,\ \ \theta = ( \theta _ {1} \dots \theta _ {k} ) . $$

Example 1. Let $ X = ( X _ {1} \dots X _ {n} ) $ be a sequence of independent random variables (observations) with common distribution $ {\mathsf P} _ \theta $, $ \theta \in \Theta $. If there is a density

$$ f ( x , \theta ) = \ \frac{d {\mathsf P} _ \theta }{dm} ( x) $$

relative to some measure $ m $, then

$$ L ( \theta ) = \ \prod _ { j=1} ^ { n } f ( X _ {j} , \theta ) $$

and the equations (1) take the form

$$ \tag{2 } \sum _ { j=1} ^ { n } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} f ( X _ {j} , \theta ) = 0 ,\ \ i = 1 \dots k . $$

Example 2. In Example 1, let $ {\mathsf P} _ \theta $ be the normal distribution with density

$$ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \left \{ - \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$

where $ x \in \mathbf R ^ {1} $, $ \theta = ( a , \sigma ^ {2} ) $, $ - \infty < a < \infty $, $ \sigma ^ {2} > 0 $. Equations (2) become

$$ \frac{1}{\sigma ^ {2} } \sum _ { j=1} ^ { n } ( X _ {j} - a ) = 0 , $$

$$ \frac{1}{2 \sigma ^ {4} } \sum _ { j=1} ^ { n } ( X _ {j} - a ) ^ {2} - \frac{n}{2 \sigma ^ {2} } = 0 ; $$

and the maximum-likelihood estimator is given by

$$ \widehat{a} = X = \frac{1}{n} \sum _ { j=1} ^ { n } X _ {j} ,\ \ \widehat \sigma {} ^ {2} = \frac{1}{n} \sum _ { j=1} ^ { n } ( X _ {j} - \overline{X}\; ) ^ {2} . $$

Example 3. In Example 1, let $ X _ {j} $ take the values $ 0 $ and $ 1 $ with probabilities $ 1 - \theta $, $ \theta $, respectively. Then

$$ L ( \theta ) = \ \prod _ { j=1} ^ { n } \theta ^ {X _ {j} } ( 1 - \theta ) ^ {1 - X _ {j} } , $$

and the maximum-likelihood estimator is $ \widehat \theta = \overline{X}\; $.

Example 4. Let the observation $ X = X _ {t} $ be a diffusion process with stochastic differential

$$ d X _ {t} = \theta a _ {t} ( X _ {t} ) + d W _ {t} ,\ \ X _ {0} = 0 ,\ 0 \leq t \leq T , $$

where $ W _ {t} $ is a Wiener process and $ \theta $ is an unknown one-dimensional parameter. Here (see [3]),

$$ \mathop{\rm log} L ( \theta ) = \ \theta \int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} - \frac{\theta ^ {2} }{2} \int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {T} ) d t , $$

$$ \widehat \theta = \frac{\int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} }{\int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {t} ) d t } . $$

There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions, $ \widehat \theta _ {n} \rightarrow \theta $ with $ {\mathsf P} _ \theta $- probability $ 1 $. If the Fisher information

$$ I ( \theta ) = \ \int\limits \frac{| f _ \theta ^ { \prime } ( x , \theta ) | ^ {2} }{f ( x , \theta ) } m ( dx ) $$

exists, then the difference $ \sqrt n ( \widehat \theta _ {n} - \theta ) $ is asymptotically normal with parameters $ ( 0 , I ^ {-} 1 ( \theta ) ) $, and $ \widehat \theta _ {n} $, in a well-defined sense, has an asymptotically-minimal mean-square deviation from $ \theta $( see [4], [5]).

#### References

[1] | H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946) |

[2] | S. Zacks, "The theory of statistical inference" , Wiley (1975) |

[3] | R.S. Liptser, A.N. Shiryaev, "Statistics of random processes" , 1 , Springer (1977) (Translated from Russian) |

[4] | A.I. Ibragimov, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian) |

[5] | E.L. Lehmann, "Theory of point estimation" , Wiley (1983) |

**How to Cite This Entry:**

Maximum-likelihood method.

*Encyclopedia of Mathematics.*URL: http://encyclopediaofmath.org/index.php?title=Maximum-likelihood_method&oldid=54910