Difference between revisions of "Maximum-likelihood method"
m (link) |
Ulf Rehmann (talk | contribs) m (tex encoded by computer) |
||
Line 1: | Line 1: | ||
+ | <!-- | ||
+ | m0631001.png | ||
+ | $#A+1 = 49 n = 0 | ||
+ | $#C+1 = 49 : ~/encyclopedia/old_files/data/M063/M.0603100 Maximum\AAhlikelihood method | ||
+ | Automatically converted into TeX, above some diagnostics. | ||
+ | Please remove this comment and the {{TEX|auto}} line below, | ||
+ | if TeX found to be correct. | ||
+ | --> | ||
+ | |||
+ | {{TEX|auto}} | ||
+ | {{TEX|done}} | ||
+ | |||
One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory. | One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory. | ||
− | Suppose one has, for an observation | + | Suppose one has, for an observation $ X $ |
+ | with distribution $ {\mathsf P} _ \theta $ | ||
+ | depending on an unknown parameter $ \theta \in \Theta \subseteq \mathbf R ^ {k} $, | ||
+ | the task to estimate $ \theta $. | ||
+ | Assuming that all measures $ {\mathsf P} _ \theta $ | ||
+ | are absolutely continuous relative to a common measure $ \nu $, | ||
+ | the likelihood function is defined by | ||
− | + | $$ | |
+ | L ( \theta ) = \ | ||
− | + | \frac{d {\mathsf P} _ \theta }{d \nu } | |
− | + | ( X ) . | |
+ | $$ | ||
− | + | The maximum-likelihood method recommends taking as an estimator for $ \theta $ | |
+ | the statistic $ \widehat \theta $ | ||
+ | defined by | ||
− | + | $$ | |
+ | L ( \widehat \theta ) = \ | ||
+ | \max _ {\theta \in \Theta ^ {c} } \ | ||
+ | L ( \theta ) . | ||
+ | $$ | ||
− | + | $ \widehat \theta $ | |
+ | is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a [[likelihood equation]] | ||
− | + | $$ \tag{1 } | |
− | + | \frac \partial {\partial \theta _ {i} } | |
− | + | \mathop{\rm log} L ( \theta ) = 0 ,\ \ | |
+ | i = 1 \dots k ,\ \ | ||
+ | \theta = ( \theta _ {1} \dots \theta _ {k} ) . | ||
+ | $$ | ||
+ | |||
+ | Example 1. Let $ X = ( X _ {1} \dots X _ {n} ) $ | ||
+ | be a sequence of independent random variables (observations) with common distribution $ {\mathsf P} _ \theta $, | ||
+ | $ \theta \in \Theta $. | ||
+ | If there is a density | ||
+ | |||
+ | $$ | ||
+ | f ( x , \theta ) = \ | ||
+ | |||
+ | \frac{d {\mathsf P} _ \theta }{dm} | ||
+ | ( x) | ||
+ | $$ | ||
+ | |||
+ | relative to some measure $ m $, | ||
+ | then | ||
+ | |||
+ | $$ | ||
+ | L ( \theta ) = \ | ||
+ | \prod _ { j= } 1 ^ { n } f ( X _ {j} , \theta ) | ||
+ | $$ | ||
and the equations (1) take the form | and the equations (1) take the form | ||
− | + | $$ \tag{2 } | |
+ | \sum _ { j= } 1 ^ { n } | ||
+ | |||
+ | \frac \partial {\partial \theta _ {i} } | ||
− | + | \mathop{\rm log} f ( X _ {j} , \theta ) = 0 ,\ \ | |
+ | i = 1 \dots k . | ||
+ | $$ | ||
− | + | Example 2. In Example 1, let $ {\mathsf P} _ \theta $ | |
+ | be the [[Normal distribution|normal distribution]] with density | ||
− | + | $$ | |
− | + | \frac{1}{\sigma \sqrt {2 \pi } } | |
− | < | + | \mathop{\rm exp} \left \{ |
+ | - | ||
+ | \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } | ||
+ | \right \} , | ||
+ | $$ | ||
+ | |||
+ | where $ x \in \mathbf R ^ {1} $, | ||
+ | $ \theta = ( a , \sigma ^ {2} ) $, | ||
+ | $ - \infty < a < \infty $, | ||
+ | $ \sigma ^ {2} > 0 $. | ||
+ | Equations (2) become | ||
+ | |||
+ | $$ | ||
+ | |||
+ | \frac{1}{\sigma ^ {2} } | ||
+ | |||
+ | \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) = 0 , | ||
+ | $$ | ||
+ | |||
+ | $$ | ||
+ | |||
+ | \frac{1}{2 \sigma ^ {4} } | ||
+ | \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) ^ {2} - | ||
+ | \frac{n}{2 \sigma ^ {2} } | ||
+ | = 0 ; | ||
+ | $$ | ||
and the maximum-likelihood estimator is given by | and the maximum-likelihood estimator is given by | ||
− | + | $$ | |
+ | \widehat{a} = X = | ||
+ | \frac{1}{n} | ||
+ | |||
+ | \sum _ { j= } 1 ^ { n } X _ {j} ,\ \ | ||
+ | \widehat \sigma {} ^ {2} = | ||
+ | \frac{1}{n} | ||
+ | |||
+ | \sum _ { j= } 1 ^ { n } ( X _ {j} - \overline{X}\; ) ^ {2} . | ||
+ | $$ | ||
+ | |||
+ | Example 3. In Example 1, let $ X _ {j} $ | ||
+ | take the values $ 0 $ | ||
+ | and $ 1 $ | ||
+ | with probabilities $ 1 - \theta $, | ||
+ | $ \theta $, | ||
+ | respectively. Then | ||
− | + | $$ | |
+ | L ( \theta ) = \ | ||
+ | \prod _ { j= } 1 ^ { n } | ||
+ | \theta ^ {X _ {j} } ( 1 - \theta ) ^ {1 - X _ {j} } , | ||
+ | $$ | ||
− | + | and the maximum-likelihood estimator is $ \widehat \theta = \overline{X}\; $. | |
− | + | Example 4. Let the observation $ X = X _ {t} $ | |
+ | be a [[Diffusion process|diffusion process]] with [[Stochastic differential|stochastic differential]] | ||
− | + | $$ | |
+ | d X _ {t} = \theta a _ {t} ( X _ {t} ) + d W _ {t} ,\ \ | ||
+ | X _ {0} = 0 ,\ 0 \leq t \leq T , | ||
+ | $$ | ||
− | + | where $ W _ {t} $ | |
+ | is a [[Wiener process|Wiener process]] and $ \theta $ | ||
+ | is an unknown one-dimensional parameter. Here (see [[#References|[3]]]), | ||
− | + | $$ | |
+ | \mathop{\rm log} L ( \theta ) = \ | ||
+ | \theta \int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} - | ||
+ | \frac{\theta ^ {2} }{2} | ||
+ | \int\limits _ { 0 } ^ { T } | ||
+ | a _ {t} ^ {2} ( X _ {T} ) d t , | ||
+ | $$ | ||
− | + | $$ | |
+ | \widehat \theta = | ||
+ | \frac{\int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} }{\int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {t} ) d t } | ||
+ | . | ||
+ | $$ | ||
− | + | There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions, $ \widehat \theta _ {n} \rightarrow \theta $ | |
+ | with $ {\mathsf P} _ \theta $- | ||
+ | probability $ 1 $. | ||
+ | If the Fisher information | ||
− | + | $$ | |
+ | I ( \theta ) = \ | ||
+ | \int\limits | ||
+ | \frac{| f _ \theta ^ { \prime } ( x , \theta ) | ^ {2} }{f ( x , \theta ) } | ||
− | + | m ( dx ) | |
+ | $$ | ||
− | exists, then the difference | + | exists, then the difference $ \sqrt n ( \widehat \theta _ {n} - \theta ) $ |
+ | is asymptotically normal with parameters $ ( 0 , I ^ {-} 1 ( \theta ) ) $, | ||
+ | and $ \widehat \theta _ {n} $, | ||
+ | in a well-defined sense, has an asymptotically-minimal mean-square deviation from $ \theta $( | ||
+ | see [[#References|[4]]], [[#References|[5]]]). | ||
====References==== | ====References==== | ||
<table><TR><TD valign="top">[1]</TD> <TD valign="top"> H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top"> S. Zacks, "The theory of statistical inference" , Wiley (1975)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top"> R.S. Liptser, A.N. Shiryaev, "Statistics of random processes" , '''1''' , Springer (1977) (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top"> A.I. Ibragimov, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top"> E.L. Lehmann, "Theory of point estimation" , Wiley (1983)</TD></TR></table> | <table><TR><TD valign="top">[1]</TD> <TD valign="top"> H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top"> S. Zacks, "The theory of statistical inference" , Wiley (1975)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top"> R.S. Liptser, A.N. Shiryaev, "Statistics of random processes" , '''1''' , Springer (1977) (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top"> A.I. Ibragimov, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top"> E.L. Lehmann, "Theory of point estimation" , Wiley (1983)</TD></TR></table> |
Revision as of 08:00, 6 June 2020
One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory.
Suppose one has, for an observation $ X $ with distribution $ {\mathsf P} _ \theta $ depending on an unknown parameter $ \theta \in \Theta \subseteq \mathbf R ^ {k} $, the task to estimate $ \theta $. Assuming that all measures $ {\mathsf P} _ \theta $ are absolutely continuous relative to a common measure $ \nu $, the likelihood function is defined by
$$ L ( \theta ) = \ \frac{d {\mathsf P} _ \theta }{d \nu } ( X ) . $$
The maximum-likelihood method recommends taking as an estimator for $ \theta $ the statistic $ \widehat \theta $ defined by
$$ L ( \widehat \theta ) = \ \max _ {\theta \in \Theta ^ {c} } \ L ( \theta ) . $$
$ \widehat \theta $ is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a likelihood equation
$$ \tag{1 } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} L ( \theta ) = 0 ,\ \ i = 1 \dots k ,\ \ \theta = ( \theta _ {1} \dots \theta _ {k} ) . $$
Example 1. Let $ X = ( X _ {1} \dots X _ {n} ) $ be a sequence of independent random variables (observations) with common distribution $ {\mathsf P} _ \theta $, $ \theta \in \Theta $. If there is a density
$$ f ( x , \theta ) = \ \frac{d {\mathsf P} _ \theta }{dm} ( x) $$
relative to some measure $ m $, then
$$ L ( \theta ) = \ \prod _ { j= } 1 ^ { n } f ( X _ {j} , \theta ) $$
and the equations (1) take the form
$$ \tag{2 } \sum _ { j= } 1 ^ { n } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} f ( X _ {j} , \theta ) = 0 ,\ \ i = 1 \dots k . $$
Example 2. In Example 1, let $ {\mathsf P} _ \theta $ be the normal distribution with density
$$ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \left \{ - \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$
where $ x \in \mathbf R ^ {1} $, $ \theta = ( a , \sigma ^ {2} ) $, $ - \infty < a < \infty $, $ \sigma ^ {2} > 0 $. Equations (2) become
$$ \frac{1}{\sigma ^ {2} } \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) = 0 , $$
$$ \frac{1}{2 \sigma ^ {4} } \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) ^ {2} - \frac{n}{2 \sigma ^ {2} } = 0 ; $$
and the maximum-likelihood estimator is given by
$$ \widehat{a} = X = \frac{1}{n} \sum _ { j= } 1 ^ { n } X _ {j} ,\ \ \widehat \sigma {} ^ {2} = \frac{1}{n} \sum _ { j= } 1 ^ { n } ( X _ {j} - \overline{X}\; ) ^ {2} . $$
Example 3. In Example 1, let $ X _ {j} $ take the values $ 0 $ and $ 1 $ with probabilities $ 1 - \theta $, $ \theta $, respectively. Then
$$ L ( \theta ) = \ \prod _ { j= } 1 ^ { n } \theta ^ {X _ {j} } ( 1 - \theta ) ^ {1 - X _ {j} } , $$
and the maximum-likelihood estimator is $ \widehat \theta = \overline{X}\; $.
Example 4. Let the observation $ X = X _ {t} $ be a diffusion process with stochastic differential
$$ d X _ {t} = \theta a _ {t} ( X _ {t} ) + d W _ {t} ,\ \ X _ {0} = 0 ,\ 0 \leq t \leq T , $$
where $ W _ {t} $ is a Wiener process and $ \theta $ is an unknown one-dimensional parameter. Here (see [3]),
$$ \mathop{\rm log} L ( \theta ) = \ \theta \int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} - \frac{\theta ^ {2} }{2} \int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {T} ) d t , $$
$$ \widehat \theta = \frac{\int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} }{\int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {t} ) d t } . $$
There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions, $ \widehat \theta _ {n} \rightarrow \theta $ with $ {\mathsf P} _ \theta $- probability $ 1 $. If the Fisher information
$$ I ( \theta ) = \ \int\limits \frac{| f _ \theta ^ { \prime } ( x , \theta ) | ^ {2} }{f ( x , \theta ) } m ( dx ) $$
exists, then the difference $ \sqrt n ( \widehat \theta _ {n} - \theta ) $ is asymptotically normal with parameters $ ( 0 , I ^ {-} 1 ( \theta ) ) $, and $ \widehat \theta _ {n} $, in a well-defined sense, has an asymptotically-minimal mean-square deviation from $ \theta $( see [4], [5]).
References
[1] | H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946) |
[2] | S. Zacks, "The theory of statistical inference" , Wiley (1975) |
[3] | R.S. Liptser, A.N. Shiryaev, "Statistics of random processes" , 1 , Springer (1977) (Translated from Russian) |
[4] | A.I. Ibragimov, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian) |
[5] | E.L. Lehmann, "Theory of point estimation" , Wiley (1983) |
Maximum-likelihood method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Maximum-likelihood_method&oldid=47805