Namespaces
Variants
Actions

Difference between revisions of "EM algorithm"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (AUTOMATIC EDIT (latexlist): Replaced 133 formulas out of 133 by TEX code with an average confidence of 2.0 and a minimal confidence of 2.0.)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
 +
<!--This article has been texified automatically. Since there was no Nroff source code for this article,
 +
the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist
 +
was used.
 +
If the TeX and formula formatting is correct, please remove this message and the {{TEX|semi-auto}} category.
 +
 +
Out of 133 formulas, 133 were replaced by TEX code.-->
 +
 +
{{TEX|semi-auto}}{{TEX|done}}
 
''expectation-maximization algorithm''
 
''expectation-maximization algorithm''
  
 
An iterative optimization procedure [[#References|[a4]]] for computing
 
An iterative optimization procedure [[#References|[a4]]] for computing
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201201.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a1)</td></tr></table>
+
\begin{equation} \tag{a1} \theta ^ { * } = \operatorname { arg } \operatorname { max } _ { \theta \in \Theta } \int f ( \theta , \phi ) d \phi, \end{equation}
 +
 
 +
where $f ( \theta , \phi )$ is a non-negative real valued function which is integrable as a function of $\phi$ for each $\theta$. The EM algorithm was developed for statistical inference in problems with incomplete data or problems that can be formulated as such (e.g., with latent-variable modeling) and is a very popular method of computing maximum likelihood, restricted maximum likelihood, penalized maximum likelihood, and maximum posterior estimates. The name, EM algorithm, stems from this context, where  "E" stands for the expectation (i.e., integration) step and  "M" stands for the maximization step. In particular, the algorithm starts with an initial value $\theta ^ { ( 0 ) } \in \Theta$ and iterates until convergence the two following steps for $t = 0,1 , \ldots$:
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201202.png" /> is a non-negative real valued function which is integrable as a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201203.png" /> for each <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201204.png" />. The EM algorithm was developed for statistical inference in problems with incomplete data or problems that can be formulated as such (e.g., with latent-variable modeling) and is a very popular method of computing maximum likelihood, restricted maximum likelihood, penalized maximum likelihood, and maximum posterior estimates. The name, EM algorithm, stems from this context, where  "E"  stands for the expectation (i.e., integration) step and  "M"  stands for the maximization step. In particular, the algorithm starts with an initial value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201205.png" /> and iterates until convergence the two following steps for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201206.png" />:''''''<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1">E-step:</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1">Compute <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201207.png" />,</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1">where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201208.png" />;</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1">M-step:</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1">Determine <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e1201209.png" /> by maximizing <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012010.png" />,</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1">that is, find <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012011.png" /> so that</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012012.png" /> for all <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012013.png" />.</td> </tr> </tbody> </table>
+
<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellpadding="4" cellspacing="1" style="background-color:black;"> <tr> <td colname="1" colspan="1" style="background-color:white;">E-step:</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">Compute $Q ( \theta | \theta ^ { ( t ) } ) = \int \operatorname { log } f ( \theta , \phi ) f ( \phi | \theta ^ { ( t ) } ) d \phi$,</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">where $f ( \phi | \theta ) = f ( \theta , \phi ) / \int f ( \theta , \phi ) d \phi$;</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">M-step:</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">Determine $\theta ^ { ( t + 1 ) }$ by maximizing $Q ( \theta | \theta ^ { ( t ) } )$,</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">that is, find $\theta ^ { ( t + 1 ) }$ so that</td> </tr> <tr> <td colname="1" colspan="1" style="background-color:white;">$Q ( \theta ^ { ( t + 1 ) } | \theta ^ { ( t ) } ) \geq Q ( \theta | \theta ^ { ( t ) } )$ for all $\theta \in \Theta$.</td> </tr> </table>
  
 
</td></tr> </table>
 
</td></tr> </table>
Line 15: Line 25:
  
 
===Maximum-likelihood estimation with missing observations.===
 
===Maximum-likelihood estimation with missing observations.===
The standard notation and terminology for the EM algorithm stem from this important class of applications. In this case the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012014.png" /> is the complete-data likelihood function, which is written <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012015.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012016.png" /> is the complete data which consists of the observed data, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012017.png" />, and the missing data, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012018.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012019.png" /> is a model parameter, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012020.png" />. One is interested in computing the value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012021.png" /> that maximizes <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012022.png" />, the maximum-likelihood estimate. Here it is assumed that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012023.png" /> is missing at random [[#References|[a26]]]; more complicated missing data mechanisms can also be accounted for in the formulation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012024.png" />. More generally, one can augment <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012025.png" /> to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012026.png" />, the augmented data, via a many-to-one mapping, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012027.png" />, with the understanding that the augmented-data likelihood, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012028.png" />, is linked to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012029.png" /> via <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012030.png" />.
+
The standard notation and terminology for the EM algorithm stem from this important class of applications. In this case the function $f ( \theta , \phi )$ is the complete-data likelihood function, which is written $L ( \theta | Y _ { \text{com} } )$, where $Y _ { \text{com} } = ( Y _ { \text{obs} } , Y _ { \text{mis} } )$ is the complete data which consists of the observed data, $Y _ { \text{obs}}$, and the missing data, $Y _ { \text{mis} }$, $\theta$ is a model parameter, and $\phi = Y _ { \text{mis} }$. One is interested in computing the value of $\theta$ that maximizes $L ( \theta | Y _ { \text{obs} } ) = \int L ( \theta | Y _ { \text{com} } ) d Y_{\text{mis}}$, the maximum-likelihood estimate. Here it is assumed that $Y _ { \text{mis} }$ is missing at random [[#References|[a26]]]; more complicated missing data mechanisms can also be accounted for in the formulation of $L ( \theta | Y _ { \text{com} } )$. More generally, one can augment $Y _ { \text{obs}}$ to $Y _ { \operatorname { aug } }$, the augmented data, via a many-to-one mapping, $Y _ { \text{obs} } = \mathcal{M} ( Y _ { \text{aug} } )$, with the understanding that the augmented-data likelihood, $L ( \theta | Y _ { \text { aug } } )$, is linked to $L ( \theta | Y _ { \text{obs} } )$ via $L ( \theta | Y _ { \text{obs}} ) = \int _ { \mathcal{M} ( Y _ { \text { aug } } ) = Y _ { \text { obs } } } L ( \theta | Y _ { \text { aug } } ) d Y _ { \text { aug } }$.
  
 
The EM algorithm builds on the intuitive notion that:
 
The EM algorithm builds on the intuitive notion that:
Line 23: Line 33:
 
ii) if the model parameters were known, the missing data could easily be imputed (i.e., predicted) by its conditional expectation. These two observations correspond to the M-step and the E-step, respectively, with the proviso that not the missing data, but rather the complete-data log-likelihood should be imputed by its conditional expectation. In particular, the E-step reduces to computing
 
ii) if the model parameters were known, the missing data could easily be imputed (i.e., predicted) by its conditional expectation. These two observations correspond to the M-step and the E-step, respectively, with the proviso that not the missing data, but rather the complete-data log-likelihood should be imputed by its conditional expectation. In particular, the E-step reduces to computing
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012031.png" /></td> </tr></table>
+
\begin{equation*} Q ( \theta | \theta ^ { ( t ) } ) = \mathsf{E} \left[ \operatorname { log } L ( \theta | Y _ { \text{aug} } ) | Y _ { \text{obs} } , \theta ^ { ( t ) } \right], \end{equation*}
  
the conditional expectation of the complete-data log-likelihood. If the complete-data model is from an exponential family, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012032.png" /> is linear in a set of complete-data sufficient statistics. In this common case which includes multivariate normal, Poisson, multinomial, and exponential models (among others), computing <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012033.png" /> involves routine calculations. The M-step then involves computing the maximum-likelihood estimates as if there were no missing data, by using the imputed complete-data sufficient statistics from the E-step as inputs.
+
the conditional expectation of the complete-data log-likelihood. If the complete-data model is from an exponential family, then $\operatorname { log } L ( \theta | Y _ { aug } )$ is linear in a set of complete-data sufficient statistics. In this common case which includes multivariate normal, Poisson, multinomial, and exponential models (among others), computing $Q ( \theta | \theta ^ { ( t ) } )$ involves routine calculations. The M-step then involves computing the maximum-likelihood estimates as if there were no missing data, by using the imputed complete-data sufficient statistics from the E-step as inputs.
  
 
===Linear inverse problems with positivity restrictions.===
 
===Linear inverse problems with positivity restrictions.===
 
Consider the system of equations
 
Consider the system of equations
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012034.png" /></td> </tr></table>
+
\begin{equation*} g _ { j } = \sum _ { i = 1 } ^ { M } f _ { i } h _ {i j } , j = 1 , \ldots , N, \end{equation*}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012035.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012036.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012037.png" /> are known, and one wishes to solve for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012038.png" />. Such problems are omnipresent in scientific and engineering applications and can be solved with the EM algorithm [[#References|[a30]]]. First one notes that, without loss of generality, one may assume that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012039.png" />, that is, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012040.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012041.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012042.png" /> for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012043.png" /> are discrete probability measures (see [[#References|[a30]]] for details). Thus, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012044.png" /> can be viewed as the mixing weights in a finite mixture model, i.e., <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012045.png" />. A finite mixture model is a hierarchical model in that one can suppose that an integer <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012046.png" /> is chosen according to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012047.png" />, and then data are generated conditional on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012048.png" /> according to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012049.png" />. The marginal distribution of the data generated is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012050.png" />. If one considers <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012051.png" /> to be missing data, one can derive an EM algorithm of the form
+
where $g _ { j } &gt; 0$, $h_{i j} \geq 0$, $\sum _ { j } h _ { ij } &gt; 0$ are known, and one wishes to solve for $f _ { i } &gt; 0$. Such problems are omnipresent in scientific and engineering applications and can be solved with the EM algorithm [[#References|[a30]]]. First one notes that, without loss of generality, one may assume that $\sum _ { j } g _ { j }  = \sum _ { i } f _ { i } = \sum _ { j } h _ { i j } = 1$, that is, $g = ( g _ { 1 } , \dots , g _ { N } )$, $f = ( f _ { 1 } , \ldots , f _ { M } )$ and $h _ { i } = ( h _ { i 1 } , \dots , h _ { i N } )$ for $i = 1 , \dots , M$ are discrete probability measures (see [[#References|[a30]]] for details). Thus, $f$ can be viewed as the mixing weights in a finite mixture model, i.e., $g$. A finite mixture model is a hierarchical model in that one can suppose that an integer $z \in ( 1 , \dots , M )$ is chosen according to $f$, and then data are generated conditional on $z$ according to $h _ { z }$. The marginal distribution of the data generated is $g$. If one considers $z$ to be missing data, one can derive an EM algorithm of the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012052.png" /></td> </tr></table>
+
\begin{equation*} f _ { i } ^{( t + 1 ) }= f _ { i }^{ ( t )} \sum _ { j } \left( \frac { h _ { i j } } { \sum _ { k } f _ { k }^{ ( t )} h _ { k j } ) } \right) g _ { j } , t = 1,2 ,\dots  \end{equation*}
  
Y. Vardi and D. Lee [[#References|[a30]]] demonstrate that this iteration converges to the value that minimizes the [[Kullback–Leibler information|Kullback–Leibler information]] divergence between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012053.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012054.png" /> over all non-negative <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012055.png" />, which is the desired non-negative solution if it exists.
+
Y. Vardi and D. Lee [[#References|[a30]]] demonstrate that this iteration converges to the value that minimizes the [[Kullback–Leibler information|Kullback–Leibler information]] divergence between $g$ and $\sum _ { i } f _ { i } h _ { i }$ over all non-negative $f$, which is the desired non-negative solution if it exists.
  
===Multivariate <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012056.png" />-model, a latent variable model.===
+
===Multivariate $t$-model, a latent variable model.===
The previous example illustrates an important principle in the application of the EM algorithm. Namely, one is often initially interested in optimizing a function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012057.png" /> and purposely embeds <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012058.png" /> in a function, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012059.png" />, with a larger domain, such that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012060.png" />, in order to use the EM algorithm. In the previous example, this was accomplished by identifying <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012061.png" /> with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012062.png" />. Similar strategies are often fruitful in various latent variable models. Consider, for example, the multivariate <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012063.png" />-model, which is useful for robust estimation (cf. [[Robust statistics|Robust statistics]]; see also, e.g., [[#References|[a9]]], [[#References|[a11]]]),
+
The previous example illustrates an important principle in the application of the EM algorithm. Namely, one is often initially interested in optimizing a function $f ( \theta )$ and purposely embeds $f ( \theta )$ in a function, $f ( \theta , \phi )$, with a larger domain, such that $f ( \theta ) = \int f ( \theta , \phi ) d \phi$, in order to use the EM algorithm. In the previous example, this was accomplished by identifying $\phi = Y _ { \text{mis} }$ with $z$. Similar strategies are often fruitful in various latent variable models. Consider, for example, the multivariate $t$-model, which is useful for robust estimation (cf. [[Robust statistics|Robust statistics]]; see also, e.g., [[#References|[a9]]], [[#References|[a11]]]),
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012064.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a2)</td></tr></table>
+
\begin{equation} \tag{a2} f ( y | \mu , \Sigma , \nu ) \propto \end{equation}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012065.png" /></td> </tr></table>
+
\begin{equation*} \propto \| \Sigma \| ^ { - 1 / 2 } [ \nu + ( y - \mu ) ^ { T } \Sigma ^ { - 1 } ( y - \mu ) ] ^ { - ( \nu + p ) / 2 }, \end{equation*}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012066.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012067.png" /> is a positive-definite <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012068.png" />-matrix, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012069.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012070.png" /> is the known degree of freedom. Given a random sample <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012071.png" />, assumed to be from (a2), one wishes to find the maximizer of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012072.png" />, which is known to have no general closed-form solution. In order to apply the EM algorithm, one embeds <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012073.png" /> into a larger model <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012074.png" />. This is accomplished via the well-known representation
+
where $y , \mu \in \mathbf{R} ^ { p }$, $\Sigma$ is a positive-definite $( p \times p )$-matrix, $\theta = ( \mu , \Sigma )$, and $\nu \in \mathbf{R} ^ { + }$ is the known degree of freedom. Given a random sample $\{ y _ { i } : i = 1 , \dots , n \} = Y _ { \operatorname{obs} }$, assumed to be from (a2), one wishes to find the maximizer of $L ( \mu , \Sigma | Y _ { \text{obs} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i } | \mu , \Sigma , \nu )$, which is known to have no general closed-form solution. In order to apply the EM algorithm, one embeds $L ( \mu , \Sigma | Y _ { \operatorname{obs} } )$ into a larger model $L ( \mu , \Sigma | Y _ { \text{aug} } )$. This is accomplished via the well-known representation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012075.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a3)</td></tr></table>
+
\begin{equation} \tag{a3} t = \mu + \frac { \Sigma ^ { 1 / 2 } Z } { \sqrt { q } }, \end{equation}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012076.png" /> follows the [[T-distribution|<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012077.png" />-distribution]] in (a2), <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012078.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012079.png" /> is a mean chi-square variable with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012080.png" /> degrees of freedom independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012081.png" />. In (a3), one sees that the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012082.png" /> conditional on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012083.png" /> is multivariate normal. Thus, if one had observed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012084.png" />, maximum-likelihood estimation would be easy. One therefore defines <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012085.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012086.png" />, which is easy to maximize as a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012087.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012088.png" /> due to the conditional normality of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012089.png" />. Thus, since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012090.png" /> is linear in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012091.png" />, the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012092.png" />st iteration of the EM algorithm has a simple E-step that computes
+
where $t$ follows the [[T-distribution|$t$-distribution]] in (a2), $Z \sim N _ { p } ( 0 , I )$, and $q \sim X _ { \nu } ^ { 2 } / \nu$ is a mean chi-square variable with $\nu$ degrees of freedom independent of $Z$. In (a3), one sees that the distribution of $t$ conditional on $q$ is multivariate normal. Thus, if one had observed $q$, maximum-likelihood estimation would be easy. One therefore defines $Y _ { \text{aug} } = \{ ( y _ { i } , q _ { i } ) : i = 1 , \ldots , n \}$ and $L ( \mu , \Sigma | Y _ { \operatorname{aug} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i } | \mu , \Sigma , \nu , q _ { i } ) f ( q _ { i } | \nu )$, which is easy to maximize as a function of $\mu$ and $\Sigma$ due to the conditional normality of $t$. Thus, since $\operatorname { log } L ( \mu , \Sigma | Y _ {\text{ aug} } )$ is linear in $( q _ { 1 } , \dots , q _ { n } )$, the $( t + 1 )$st iteration of the EM algorithm has a simple E-step that computes
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012093.png" /></td> </tr></table>
+
\begin{equation*} w _ { i } ^ { ( t + 1 ) } = \mathsf{E} ( q _ { i } | y _ { i } , \mu ^ { ( t ) } , \Sigma ^ { ( t ) } ) = \frac { \nu + p } { \nu + d _ { i } ^ { ( t ) } } , i = 1 , \dots , n, \end{equation*}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012094.png" />. The M-step then maximizes <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012095.png" /> to obtain
+
where $d _ { i } ^ { ( t ) } = ( y _ { i } - \mu ^ { ( t ) } ) ^ { T } [ \Sigma ^ { ( t ) } ] ^ { - 1 } ( y _ { i } - \mu ^ { ( t ) } )$. The M-step then maximizes $Q ( \theta | \theta ^ { ( t ) } )$ to obtain
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012096.png" /></td> </tr></table>
+
\begin{equation*} \mu ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } y _ { i } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }, \end{equation*}
  
 
and
 
and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012097.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a4)</td></tr></table>
+
\begin{equation} \tag{a4} \Sigma ^ { ( t + 1 ) } = \frac { 1 } { n } \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i } - \mu ^ { ( t + 1 ) } ) ( y _ { i } - \mu ^ { ( t + 1 ) } ) ^ { T }. \end{equation}
  
Given the weights <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012098.png" />, the M-step corresponds to weighted least squares, thus this EM algorithm, which often goes by the name iteratively reweighted least squares (IRLS), is trivial to program and use, and, as will be seen in the next section, exhibits unusually stable convergence.
+
Given the weights $( w _ { i } ^ { ( t + 1 ) } , \ldots , w _ { n } ^ { ( t + 1 ) } )$, the M-step corresponds to weighted least squares, thus this EM algorithm, which often goes by the name iteratively reweighted least squares (IRLS), is trivial to program and use, and, as will be seen in the next section, exhibits unusually stable convergence.
  
 
==Properties of the EM algorithm.==
 
==Properties of the EM algorithm.==
The EM algorithm has enjoyed wide popularity in many scientific fields from the 1970s onwards (e.g., [[#References|[a18]]], [[#References|[a23]]]). This is primarily due to easy implementation and stable convergence. As the previous paragraph illustrates, the EM algorithm is often easy to program and use. Although the algorithm may take many iterations to converge relative to other optimization routines (e.g., Newton–Raphson), each iteration is often easy to program and quick to compute. Moreover, the EM algorithm is less sensitive to poor starting values, and can be easier to use with many parameters since the iterations necessarily remain in the parameter space and no second derivatives are required. Finally, the EM algorithm has the very important property that the objective function is increased at each iteration. That is, by the definition of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e12012099.png" />,
+
The EM algorithm has enjoyed wide popularity in many scientific fields from the 1970s onwards (e.g., [[#References|[a18]]], [[#References|[a23]]]). This is primarily due to easy implementation and stable convergence. As the previous paragraph illustrates, the EM algorithm is often easy to program and use. Although the algorithm may take many iterations to converge relative to other optimization routines (e.g., Newton–Raphson), each iteration is often easy to program and quick to compute. Moreover, the EM algorithm is less sensitive to poor starting values, and can be easier to use with many parameters since the iterations necessarily remain in the parameter space and no second derivatives are required. Finally, the EM algorithm has the very important property that the objective function is increased at each iteration. That is, by the definition of $f ( \phi | \theta )$,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120100.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a5)</td></tr></table>
+
\begin{equation} \tag{a5} \operatorname { log } \int f ( \theta , \phi ) d \phi = \operatorname { log } f ( \theta , \phi ) - \operatorname { log } f ( \phi | \theta ) = \end{equation}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120101.png" /></td> </tr></table>
+
\begin{equation*} = Q ( \theta | \theta ^ { ( t ) } ) - \int \operatorname { log } f ( \phi | \theta ) f ( \phi | \theta ^ { ( t ) } ) d \phi, \end{equation*}
  
where the second equality follows by averaging over <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120102.png" /> according to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120103.png" />. Since the first term of (a5) is maximized by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120104.png" />, and the second is minimized by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120105.png" /> (under the assumption that the support of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120106.png" /> does not depend on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120107.png" />), one obtains
+
where the second equality follows by averaging over $\phi$ according to $f ( \phi | \theta ^ { ( t ) } )$. Since the first term of (a5) is maximized by $\theta ^ { ( t + 1 ) }$, and the second is minimized by $\theta ^ { ( t ) }$ (under the assumption that the support of $f ( \phi | \theta )$ does not depend on $\theta$), one obtains
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120108.png" /></td> </tr></table>
+
\begin{equation*} \operatorname { log } \int f ( \theta ^ { ( t + 1 ) } , \phi ) d \phi \geq \operatorname { log } \int f ( \theta ^ { (  t  ) } , \phi ) d \phi \end{equation*}
  
for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120109.png" />. This property not only contributes to the stability of the algorithm, but also is very valuable for diagnosing implementation errors.
+
for $t = 0,1 , \ldots$. This property not only contributes to the stability of the algorithm, but also is very valuable for diagnosing implementation errors.
  
Although the EM algorithm is not guaranteed to converge to even a local mode (it can converge to a saddle point [[#References|[a25]]] or even a local minimum [[#References|[a1]]]), this can easily be avoided in practice by using several  "overdispersed"  starting values. (Details of convergence properties are developed in, a.o., [[#References|[a4]]], [[#References|[a32]]], [[#References|[a2]]], [[#References|[a21]]].) Running the EM algorithm with several starting values is also recommended because it can help one to find multiple local modes of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120110.png" />, an important advantage for statistical analysis.
+
Although the EM algorithm is not guaranteed to converge to even a local mode (it can converge to a saddle point [[#References|[a25]]] or even a local minimum [[#References|[a1]]]), this can easily be avoided in practice by using several  "overdispersed"  starting values. (Details of convergence properties are developed in, a.o., [[#References|[a4]]], [[#References|[a32]]], [[#References|[a2]]], [[#References|[a21]]].) Running the EM algorithm with several starting values is also recommended because it can help one to find multiple local modes of $\int f ( \theta , \phi ) d \phi$, an important advantage for statistical analysis.
  
 
==Extensions and enhancements.==
 
==Extensions and enhancements.==
 
The advantages of the EM algorithm are diminished when either the E-step or the M-step are difficult to compute or the algorithm is very slow to converge. There are a number of extensions to the EM algorithm that can be successful in dealing with these difficulties.
 
The advantages of the EM algorithm are diminished when either the E-step or the M-step are difficult to compute or the algorithm is very slow to converge. There are a number of extensions to the EM algorithm that can be successful in dealing with these difficulties.
  
For example, the ECM algorithm ([[#References|[a20]]], [[#References|[a17]]]) is useful when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120111.png" /> is difficult to maximize as a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120112.png" />, but can be easily maximized when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120113.png" /> is constrained to one of several subspaces of the original parameter space. One assumes that the aggregation of these subspaces is space-filling in the sense that the sequence of conditional maximizations searches the entire parameter space. Thus, ECM replaces the M-step with a sequence of CM-steps (i.e., conditional maximizations) while maintaining the convergence properties of the EM algorithm, including monotone convergence.
+
For example, the ECM algorithm ([[#References|[a20]]], [[#References|[a17]]]) is useful when $Q ( \theta | \theta ^ { ( t ) } )$ is difficult to maximize as a function of $\theta$, but can be easily maximized when $\theta$ is constrained to one of several subspaces of the original parameter space. One assumes that the aggregation of these subspaces is space-filling in the sense that the sequence of conditional maximizations searches the entire parameter space. Thus, ECM replaces the M-step with a sequence of CM-steps (i.e., conditional maximizations) while maintaining the convergence properties of the EM algorithm, including monotone convergence.
  
 
The situation is somewhat more difficult when the E-step is difficult to compute, since numerical integration can be very expensive computationally. The Monte-Carlo EM algorithm [[#References|[a31]]] suggests using Monte-Carlo integration in the E-step. In some cases, Markov chain Monte-Carlo integration methods have been used successfully (e.g., [[#References|[a15]]], [[#References|[a16]]], [[#References|[a22]]], [[#References|[a3]]]). The nested EM algorithm [[#References|[a27]]] offers a general strategy for efficient implementation of Markov chain Monte Carlo within the E-step of the EM algorithm.
 
The situation is somewhat more difficult when the E-step is difficult to compute, since numerical integration can be very expensive computationally. The Monte-Carlo EM algorithm [[#References|[a31]]] suggests using Monte-Carlo integration in the E-step. In some cases, Markov chain Monte-Carlo integration methods have been used successfully (e.g., [[#References|[a15]]], [[#References|[a16]]], [[#References|[a22]]], [[#References|[a3]]]). The nested EM algorithm [[#References|[a27]]] offers a general strategy for efficient implementation of Markov chain Monte Carlo within the E-step of the EM algorithm.
  
Much emphasis in recent research is on speeding up the EM algorithm without sacrificing its stability or simplicity. A very fruitful method is through efficient data augmentation, that is, simple and  "small"  augmentation schemes, where  "small"  is measured by the so-called Fisher information — see [[#References|[a23]]] and the accompanying discussion for details. Briefly, the smaller the augmentation, the better <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120114.png" /> approximates <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120115.png" /> and thus the faster the algorithm; here, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120116.png" /> is the limit of the EM sequence — note that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120117.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120118.png" /> share the same stationary point(s) [[#References|[a4]]]. Thus, the SAGE [[#References|[a5]]], the ECME [[#References|[a11]]], and the more general the AECM [[#References|[a23]]] algorithms build on the ECM algorithm by allowing this approximation to vary from CM-step to CM-step, in order to obtain a better approximation for some CM-steps as long as this does not complicate the resulting CM-step. X.L. Meng and D.A. van Dyk [[#References|[a23]]] also introduce a working parameter into <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120119.png" />, that is, they define <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120120.png" /> such that
+
Much emphasis in recent research is on speeding up the EM algorithm without sacrificing its stability or simplicity. A very fruitful method is through efficient data augmentation, that is, simple and  "small"  augmentation schemes, where  "small"  is measured by the so-called Fisher information — see [[#References|[a23]]] and the accompanying discussion for details. Briefly, the smaller the augmentation, the better $Q ( \theta | \theta ^ { * } )$ approximates $\int f ( \theta , \phi ) d \phi$ and thus the faster the algorithm; here, $\theta ^ { * }$ is the limit of the EM sequence — note that $Q ( \theta | \theta ^ { * } )$ and $\int f ( \theta , \phi ) d \phi$ share the same stationary point(s) [[#References|[a4]]]. Thus, the SAGE [[#References|[a5]]], the ECME [[#References|[a11]]], and the more general the AECM [[#References|[a23]]] algorithms build on the ECM algorithm by allowing this approximation to vary from CM-step to CM-step, in order to obtain a better approximation for some CM-steps as long as this does not complicate the resulting CM-step. X.L. Meng and D.A. van Dyk [[#References|[a23]]] also introduce a working parameter into $f ( \theta , \phi )$, that is, they define $f ( \theta , \phi , \alpha )$ such that
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120121.png" /></td> </tr></table>
+
\begin{equation*} \int f ( \theta , \phi ) d \phi = \int f ( \theta , \phi , \alpha ) d \phi \end{equation*}
  
for all <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120122.png" /> in some class <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120123.png" />. They then choose <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120124.png" /> in order to optimize the approximation in terms of the rate of convergence of the resulting algorithm while maintaining the simplicity and stability of the EM algorithm. This results in simple, stable, and fast algorithms in a number of statistical applications. For example, in the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120125.png" />-model this procedure results in replacing the update for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120126.png" /> given in (a4) with
+
for all $\alpha$ in some class $A$. They then choose $\alpha$ in order to optimize the approximation in terms of the rate of convergence of the resulting algorithm while maintaining the simplicity and stability of the EM algorithm. This results in simple, stable, and fast algorithms in a number of statistical applications. For example, in the $t$-model this procedure results in replacing the update for $\Sigma$ given in (a4) with
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120127.png" /></td> </tr></table>
+
\begin{equation*} \Sigma ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i } - \mu ^ { ( t + 1 ) } ) ( y _ { i } - \mu ^ { ( t + 1 ) } ) ^ { T } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }. \end{equation*}
  
 
The rest of the algorithm remains the same and this simple change clearly maintains the simplicity of the algorithm but can result in a dramatic reduction in computational time compared to the standard IRLS algorithm.
 
The rest of the algorithm remains the same and this simple change clearly maintains the simplicity of the algorithm but can result in a dramatic reduction in computational time compared to the standard IRLS algorithm.
  
The parameter-expanded EM algorithm or PXEM algorithm [[#References|[a12]]] also works with a working parameter, but rather than conditioning on the optimal <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120128.png" /> it marginalizes out <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120129.png" /> by fitting it in each iteration. Detailed discussion and comparison of such conditional augmentation and marginal augmentation in the more general context of stochastic simulations (e.g., the Gibbs sampler) can be found in [[#References|[a24]]], [[#References|[a13]]], and [[#References|[a28]]].
+
The parameter-expanded EM algorithm or PXEM algorithm [[#References|[a12]]] also works with a working parameter, but rather than conditioning on the optimal $\alpha$ it marginalizes out $\alpha$ by fitting it in each iteration. Detailed discussion and comparison of such conditional augmentation and marginal augmentation in the more general context of stochastic simulations (e.g., the Gibbs sampler) can be found in [[#References|[a24]]], [[#References|[a13]]], and [[#References|[a28]]].
  
 
Other acceleration techniques have been developed by combining the EM algorithm with various numerical methods, such as Aitken acceleration (e.g., [[#References|[a14]]]), Newton–Raphson [[#References|[a7]]], quasi-Newton methods (e.g., [[#References|[a8]]]), and conjugate-gradient acceleration [[#References|[a6]]]. These methods, however, typically sacrifice monotone convergence and therefore require extra programming and special monitoring.
 
Other acceleration techniques have been developed by combining the EM algorithm with various numerical methods, such as Aitken acceleration (e.g., [[#References|[a14]]]), Newton–Raphson [[#References|[a7]]], quasi-Newton methods (e.g., [[#References|[a8]]]), and conjugate-gradient acceleration [[#References|[a6]]]. These methods, however, typically sacrifice monotone convergence and therefore require extra programming and special monitoring.
Line 101: Line 111:
 
A final pair of extensions are designed to compute
 
A final pair of extensions are designed to compute
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120130.png" /></td> </tr></table>
+
\begin{equation*} H \equiv - \frac { \partial ^ { 2 } } { \partial \theta . \partial \theta } \int f ( \theta , \phi ) d \phi | _ { \theta = \theta ^ { * } }, \end{equation*}
  
whose inverse gives a large-sample variance of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120131.png" /> in statistical analysis. The supplemented EM [[#References|[a19]]] and supplemented ECM [[#References|[a29]]] algorithms combine the analytically computable
+
whose inverse gives a large-sample variance of $\theta ^ { * }$ in statistical analysis. The supplemented EM [[#References|[a19]]] and supplemented ECM [[#References|[a29]]] algorithms combine the analytically computable
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120132.png" /></td> </tr></table>
+
\begin{equation*} \frac { \partial ^ { 2 } } { \partial \theta _ { . } \partial \theta } Q ( \theta | \theta ^ { * } ) = \theta ^ { * } \end{equation*}
  
with numerical differentiation of the EM mapping in order to compute <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/e/e120/e120120/e120120133.png" /> via a simple matrix identity.
+
with numerical differentiation of the EM mapping in order to compute $H$ via a simple matrix identity.
  
 
====References====
 
====References====
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  O. Arslan,  P.D.L. Constable,  J.T. Kent,  "Domains of convergence for the EM algorithm: A cautionary tale in a location estimation problem"  ''Statist. Comput.'' , '''3'''  (1993)  pp. 103–108</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  R.A. Boyles,  "On the convergence of the EM algorithm"  ''J.R. Statist. Soc.'' , '''B45'''  (1983)  pp. 47–50</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  J.S.K. Chan,  A.Y.C. Kuk,  "Maximum likelihood estimation for probit-linear mixed models with correlated random effects"  ''Biometrics'' , '''53'''  (1997)  pp. 86–97</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  A.P. Dempster,  N.M. Laird,  D.B. Rubin,  "Maximum likelihood estimation from incomplete-data via the EM algorithm (with discussion)"  ''J. R. Statist. Soc.'' , '''B39'''  (1977)  pp. 1–38</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  J.A. Fessler,  A.O. Hero,  "Space-alternating generalized expectation-maximization algorithm"  ''IEEE Trans. Signal Processing'' , '''42'''  (1994)  pp. 2664–77</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  M. Jamshidian,  R.I. Jennrich,  "Conjugate gradient acceleration of the EM algorithm"  ''J. Amer. Statist. Assoc.'' , '''88'''  (1993)  pp. 221–228</TD></TR><TR><TD valign="top">[a7]</TD> <TD valign="top">  K. Lange,  "A gradient algorithm locally equivalent to the EM algorithm"  ''J. R. Statist. Soc.'' , '''B57'''  (1995)  pp. 425–438</TD></TR><TR><TD valign="top">[a8]</TD> <TD valign="top">  K. Lange,  "A quasi-Newtonian acceleration of the EM algorithm"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 1–18</TD></TR><TR><TD valign="top">[a9]</TD> <TD valign="top">  K. Lange,  R.J.A. Little,  J.M.G. Taylor,  "Robust statistical modeling using the t-distribution"  ''J. Amer. Statist. Assoc.'' , '''84'''  (1989)  pp. 881–896</TD></TR><TR><TD valign="top">[a10]</TD> <TD valign="top">  C. Liu,  D.B. Rubin,  "The ECME algorithm: a simple extension of EM and ECM with fast monotone convergence"  ''Biometrika'' , '''81'''  (1994)  pp. 633–48</TD></TR><TR><TD valign="top">[a11]</TD> <TD valign="top">  C. Liu,  D.B. Rubin,  "ML estimation of the t-distribution using EM and its extensions, ECM and ECME"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 19–40</TD></TR><TR><TD valign="top">[a12]</TD> <TD valign="top">  C. Liu,  D.B. Rubin,  Y.N. Wu,  "Parameter expansion to accelerate EM: the PX-EM algorithm"  ''Biometrika''  (1998)  pp. 755–770</TD></TR><TR><TD valign="top">[a13]</TD> <TD valign="top">  J.S. Liu,  Y. Wu,  "Parameter expansion for data augmentation"  ''J. Amer. Statist. Assoc.'' , '''to appear'''  (1999)</TD></TR><TR><TD valign="top">[a14]</TD> <TD valign="top">  T.A. Louis,  "Finding the observed information matrix when using the EM algorithm"  ''J. R. Statist. Soc.'' , '''B44'''  (1982)  pp. 226–233</TD></TR><TR><TD valign="top">[a15]</TD> <TD valign="top">  C.E. McCulloch,  "Maximum likelihood variance components estimation for binary data"  ''J. Amer. Statist. Assoc.'' , '''89'''  (1994)  pp. 330–335</TD></TR><TR><TD valign="top">[a16]</TD> <TD valign="top">  C.E. McCulloch,  "Maximum likelihood algorithms for generalized linear mixed models"  ''J. Amer. Statist. Assoc.'' , '''92'''  (1997)  pp. 162–170</TD></TR><TR><TD valign="top">[a17]</TD> <TD valign="top">  X.L. Meng,  "On the rate of convergence of the ECM algorithm"  ''Ann. Math. Stat.'' , '''22'''  (1994)  pp. 326–339</TD></TR><TR><TD valign="top">[a18]</TD> <TD valign="top">  X.L. Meng,  S. Pedlow,  "EM: A bibliographic review with missing articles" , ''Proc. Statist. Comput. Sect. 24-27 Washington, D.C.'' , Amer. Statist. Assoc.  (1992)</TD></TR><TR><TD valign="top">[a19]</TD> <TD valign="top">  X.L. Meng,  D.B. Rubin,  "Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm"  ''J. Amer. Statist. Assoc.'' , '''86'''  (1991)  pp. 899–909</TD></TR><TR><TD valign="top">[a20]</TD> <TD valign="top">  X.L. Meng,  D.B. Rubin,  "Maximum likelihood estimation via the ECM algorithm: a general framework"  ''Biometrika'' , '''80'''  (1993)  pp. 267–78</TD></TR><TR><TD valign="top">[a21]</TD> <TD valign="top">  X.L. Meng,  D.B. Rubin,  "On the global and componentwise rates of convergence of the EM algorithm"  ''LALG'' , '''199'''  (1994)  pp. 413–425</TD></TR><TR><TD valign="top">[a22]</TD> <TD valign="top">  X.L. Meng,  S. Schilling,  "Fitting full-information item factor models and an empirical investigation of bridge sampling"  ''J. Amer. Statist. Assoc.'' , '''91'''  (1996)  pp. 1254–1267</TD></TR><TR><TD valign="top">[a23]</TD> <TD valign="top">  X.L. Meng,  D.A. van Dyk,  "The EM algorithm — an old folk song sung to a fast new tune (with discussion)"  ''J. R. Statist. Soc.'' , '''B59'''  (1997)  pp. 511–567</TD></TR><TR><TD valign="top">[a24]</TD> <TD valign="top">  X.L. Meng,  D.A. van Dyk,  "Seeking efficient data augmentation schemes via conditional and marginal augmentation"  ''Biometrika''  (1999)  pp. 301–320</TD></TR><TR><TD valign="top">[a25]</TD> <TD valign="top">  G.D. Murray,  "Comments on the paper by A.P. Dempster, N.M. Laird and D.B. Rubin"  ''J. R. Statist. Soc.'' , '''B39'''  (1977)  pp. 27–28</TD></TR><TR><TD valign="top">[a26]</TD> <TD valign="top">  D.B. Rubin,  "Inference and missing data"  ''Biometrika'' , '''63'''  (1976)  pp. 581–592</TD></TR><TR><TD valign="top">[a27]</TD> <TD valign="top">  D.A. van Dyk,  "Nesting EM Algorithms for computational efficiency"  ''Statistica Sinica'' , '''to appear'''  (2000)</TD></TR><TR><TD valign="top">[a28]</TD> <TD valign="top">  D.A. van Dyk,  X.L. Meng,  "The art of data augmentation"  ''Statist. Sci.'' , '''Submitted'''  (1999)</TD></TR><TR><TD valign="top">[a29]</TD> <TD valign="top">  D.A. van Dyk,  X.L. Meng,  D.B. Rubin,  "Maximum likelihood estimation via the ECM algorithm: computing the asymptotic variance"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 55–75</TD></TR><TR><TD valign="top">[a30]</TD> <TD valign="top">  Y. Vardi,  D. Lee,  "From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion)"  ''J. R. Statist. Soc.'' , '''B55'''  (1993)  pp. 569–598</TD></TR><TR><TD valign="top">[a31]</TD> <TD valign="top">  G.C.G. Wei,  M.A. Tanner,  "A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm"  ''J. Amer. Statist. Assoc.'' , '''85'''  (1990)  pp. 699–704</TD></TR><TR><TD valign="top">[a32]</TD> <TD valign="top">  C.F.J. Wu,  "On the convergence properties of the EM algorithm"  ''Ann. Math. Stat.'' , '''11'''  (1983)  pp. 95–103</TD></TR></table>
+
<table><tr><td valign="top">[a1]</td> <td valign="top">  O. Arslan,  P.D.L. Constable,  J.T. Kent,  "Domains of convergence for the EM algorithm: A cautionary tale in a location estimation problem"  ''Statist. Comput.'' , '''3'''  (1993)  pp. 103–108</td></tr><tr><td valign="top">[a2]</td> <td valign="top">  R.A. Boyles,  "On the convergence of the EM algorithm"  ''J.R. Statist. Soc.'' , '''B45'''  (1983)  pp. 47–50</td></tr><tr><td valign="top">[a3]</td> <td valign="top">  J.S.K. Chan,  A.Y.C. Kuk,  "Maximum likelihood estimation for probit-linear mixed models with correlated random effects"  ''Biometrics'' , '''53'''  (1997)  pp. 86–97</td></tr><tr><td valign="top">[a4]</td> <td valign="top">  A.P. Dempster,  N.M. Laird,  D.B. Rubin,  "Maximum likelihood estimation from incomplete-data via the EM algorithm (with discussion)"  ''J. R. Statist. Soc.'' , '''B39'''  (1977)  pp. 1–38</td></tr><tr><td valign="top">[a5]</td> <td valign="top">  J.A. Fessler,  A.O. Hero,  "Space-alternating generalized expectation-maximization algorithm"  ''IEEE Trans. Signal Processing'' , '''42'''  (1994)  pp. 2664–77</td></tr><tr><td valign="top">[a6]</td> <td valign="top">  M. Jamshidian,  R.I. Jennrich,  "Conjugate gradient acceleration of the EM algorithm"  ''J. Amer. Statist. Assoc.'' , '''88'''  (1993)  pp. 221–228</td></tr><tr><td valign="top">[a7]</td> <td valign="top">  K. Lange,  "A gradient algorithm locally equivalent to the EM algorithm"  ''J. R. Statist. Soc.'' , '''B57'''  (1995)  pp. 425–438</td></tr><tr><td valign="top">[a8]</td> <td valign="top">  K. Lange,  "A quasi-Newtonian acceleration of the EM algorithm"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 1–18</td></tr><tr><td valign="top">[a9]</td> <td valign="top">  K. Lange,  R.J.A. Little,  J.M.G. Taylor,  "Robust statistical modeling using the t-distribution"  ''J. Amer. Statist. Assoc.'' , '''84'''  (1989)  pp. 881–896</td></tr><tr><td valign="top">[a10]</td> <td valign="top">  C. Liu,  D.B. Rubin,  "The ECME algorithm: a simple extension of EM and ECM with fast monotone convergence"  ''Biometrika'' , '''81'''  (1994)  pp. 633–48</td></tr><tr><td valign="top">[a11]</td> <td valign="top">  C. Liu,  D.B. Rubin,  "ML estimation of the t-distribution using EM and its extensions, ECM and ECME"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 19–40</td></tr><tr><td valign="top">[a12]</td> <td valign="top">  C. Liu,  D.B. Rubin,  Y.N. Wu,  "Parameter expansion to accelerate EM: the PX-EM algorithm"  ''Biometrika''  (1998)  pp. 755–770</td></tr><tr><td valign="top">[a13]</td> <td valign="top">  J.S. Liu,  Y. Wu,  "Parameter expansion for data augmentation"  ''J. Amer. Statist. Assoc.'' , '''to appear'''  (1999)</td></tr><tr><td valign="top">[a14]</td> <td valign="top">  T.A. Louis,  "Finding the observed information matrix when using the EM algorithm"  ''J. R. Statist. Soc.'' , '''B44'''  (1982)  pp. 226–233</td></tr><tr><td valign="top">[a15]</td> <td valign="top">  C.E. McCulloch,  "Maximum likelihood variance components estimation for binary data"  ''J. Amer. Statist. Assoc.'' , '''89'''  (1994)  pp. 330–335</td></tr><tr><td valign="top">[a16]</td> <td valign="top">  C.E. McCulloch,  "Maximum likelihood algorithms for generalized linear mixed models"  ''J. Amer. Statist. Assoc.'' , '''92'''  (1997)  pp. 162–170</td></tr><tr><td valign="top">[a17]</td> <td valign="top">  X.L. Meng,  "On the rate of convergence of the ECM algorithm"  ''Ann. Math. Stat.'' , '''22'''  (1994)  pp. 326–339</td></tr><tr><td valign="top">[a18]</td> <td valign="top">  X.L. Meng,  S. Pedlow,  "EM: A bibliographic review with missing articles" , ''Proc. Statist. Comput. Sect. 24-27 Washington, D.C.'' , Amer. Statist. Assoc.  (1992)</td></tr><tr><td valign="top">[a19]</td> <td valign="top">  X.L. Meng,  D.B. Rubin,  "Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm"  ''J. Amer. Statist. Assoc.'' , '''86'''  (1991)  pp. 899–909</td></tr><tr><td valign="top">[a20]</td> <td valign="top">  X.L. Meng,  D.B. Rubin,  "Maximum likelihood estimation via the ECM algorithm: a general framework"  ''Biometrika'' , '''80'''  (1993)  pp. 267–78</td></tr><tr><td valign="top">[a21]</td> <td valign="top">  X.L. Meng,  D.B. Rubin,  "On the global and componentwise rates of convergence of the EM algorithm"  ''LALG'' , '''199'''  (1994)  pp. 413–425</td></tr><tr><td valign="top">[a22]</td> <td valign="top">  X.L. Meng,  S. Schilling,  "Fitting full-information item factor models and an empirical investigation of bridge sampling"  ''J. Amer. Statist. Assoc.'' , '''91'''  (1996)  pp. 1254–1267</td></tr><tr><td valign="top">[a23]</td> <td valign="top">  X.L. Meng,  D.A. van Dyk,  "The EM algorithm — an old folk song sung to a fast new tune (with discussion)"  ''J. R. Statist. Soc.'' , '''B59'''  (1997)  pp. 511–567</td></tr><tr><td valign="top">[a24]</td> <td valign="top">  X.L. Meng,  D.A. van Dyk,  "Seeking efficient data augmentation schemes via conditional and marginal augmentation"  ''Biometrika''  (1999)  pp. 301–320</td></tr><tr><td valign="top">[a25]</td> <td valign="top">  G.D. Murray,  "Comments on the paper by A.P. Dempster, N.M. Laird and D.B. Rubin"  ''J. R. Statist. Soc.'' , '''B39'''  (1977)  pp. 27–28</td></tr><tr><td valign="top">[a26]</td> <td valign="top">  D.B. Rubin,  "Inference and missing data"  ''Biometrika'' , '''63'''  (1976)  pp. 581–592</td></tr><tr><td valign="top">[a27]</td> <td valign="top">  D.A. van Dyk,  "Nesting EM Algorithms for computational efficiency"  ''Statistica Sinica'' , '''to appear'''  (2000)</td></tr><tr><td valign="top">[a28]</td> <td valign="top">  D.A. van Dyk,  X.L. Meng,  "The art of data augmentation"  ''Statist. Sci.'' , '''Submitted'''  (1999)</td></tr><tr><td valign="top">[a29]</td> <td valign="top">  D.A. van Dyk,  X.L. Meng,  D.B. Rubin,  "Maximum likelihood estimation via the ECM algorithm: computing the asymptotic variance"  ''Statistica Sinica'' , '''5'''  (1995)  pp. 55–75</td></tr><tr><td valign="top">[a30]</td> <td valign="top">  Y. Vardi,  D. Lee,  "From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion)"  ''J. R. Statist. Soc.'' , '''B55'''  (1993)  pp. 569–598</td></tr><tr><td valign="top">[a31]</td> <td valign="top">  G.C.G. Wei,  M.A. Tanner,  "A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm"  ''J. Amer. Statist. Assoc.'' , '''85'''  (1990)  pp. 699–704</td></tr><tr><td valign="top">[a32]</td> <td valign="top">  C.F.J. Wu,  "On the convergence properties of the EM algorithm"  ''Ann. Math. Stat.'' , '''11'''  (1983)  pp. 95–103</td></tr></table>

Latest revision as of 17:03, 1 July 2020

expectation-maximization algorithm

An iterative optimization procedure [a4] for computing

\begin{equation} \tag{a1} \theta ^ { * } = \operatorname { arg } \operatorname { max } _ { \theta \in \Theta } \int f ( \theta , \phi ) d \phi, \end{equation}

where $f ( \theta , \phi )$ is a non-negative real valued function which is integrable as a function of $\phi$ for each $\theta$. The EM algorithm was developed for statistical inference in problems with incomplete data or problems that can be formulated as such (e.g., with latent-variable modeling) and is a very popular method of computing maximum likelihood, restricted maximum likelihood, penalized maximum likelihood, and maximum posterior estimates. The name, EM algorithm, stems from this context, where "E" stands for the expectation (i.e., integration) step and "M" stands for the maximization step. In particular, the algorithm starts with an initial value $\theta ^ { ( 0 ) } \in \Theta$ and iterates until convergence the two following steps for $t = 0,1 , \ldots$:

E-step:
Compute $Q ( \theta | \theta ^ { ( t ) } ) = \int \operatorname { log } f ( \theta , \phi ) f ( \phi | \theta ^ { ( t ) } ) d \phi$,
where $f ( \phi | \theta ) = f ( \theta , \phi ) / \int f ( \theta , \phi ) d \phi$;
M-step:
Determine $\theta ^ { ( t + 1 ) }$ by maximizing $Q ( \theta | \theta ^ { ( t ) } )$,
that is, find $\theta ^ { ( t + 1 ) }$ so that
$Q ( \theta ^ { ( t + 1 ) } | \theta ^ { ( t ) } ) \geq Q ( \theta | \theta ^ { ( t ) } )$ for all $\theta \in \Theta$.

The usefulness of the EM algorithm is apparent when both of these steps can be accomplished with minimal analytic and computation effort but either the direct maximization or the integration in (a1) is difficult. The attractive convergence properties of the algorithm along with its many generalizations and extensions will be discussed below. First, however, an illustration of the algorithm via several applications is given.

Applications.

Maximum-likelihood estimation with missing observations.

The standard notation and terminology for the EM algorithm stem from this important class of applications. In this case the function $f ( \theta , \phi )$ is the complete-data likelihood function, which is written $L ( \theta | Y _ { \text{com} } )$, where $Y _ { \text{com} } = ( Y _ { \text{obs} } , Y _ { \text{mis} } )$ is the complete data which consists of the observed data, $Y _ { \text{obs}}$, and the missing data, $Y _ { \text{mis} }$, $\theta$ is a model parameter, and $\phi = Y _ { \text{mis} }$. One is interested in computing the value of $\theta$ that maximizes $L ( \theta | Y _ { \text{obs} } ) = \int L ( \theta | Y _ { \text{com} } ) d Y_{\text{mis}}$, the maximum-likelihood estimate. Here it is assumed that $Y _ { \text{mis} }$ is missing at random [a26]; more complicated missing data mechanisms can also be accounted for in the formulation of $L ( \theta | Y _ { \text{com} } )$. More generally, one can augment $Y _ { \text{obs}}$ to $Y _ { \operatorname { aug } }$, the augmented data, via a many-to-one mapping, $Y _ { \text{obs} } = \mathcal{M} ( Y _ { \text{aug} } )$, with the understanding that the augmented-data likelihood, $L ( \theta | Y _ { \text { aug } } )$, is linked to $L ( \theta | Y _ { \text{obs} } )$ via $L ( \theta | Y _ { \text{obs}} ) = \int _ { \mathcal{M} ( Y _ { \text { aug } } ) = Y _ { \text { obs } } } L ( \theta | Y _ { \text { aug } } ) d Y _ { \text { aug } }$.

The EM algorithm builds on the intuitive notion that:

i) if there were no missing data, maximum-likelihood estimation would be easy; and

ii) if the model parameters were known, the missing data could easily be imputed (i.e., predicted) by its conditional expectation. These two observations correspond to the M-step and the E-step, respectively, with the proviso that not the missing data, but rather the complete-data log-likelihood should be imputed by its conditional expectation. In particular, the E-step reduces to computing

\begin{equation*} Q ( \theta | \theta ^ { ( t ) } ) = \mathsf{E} \left[ \operatorname { log } L ( \theta | Y _ { \text{aug} } ) | Y _ { \text{obs} } , \theta ^ { ( t ) } \right], \end{equation*}

the conditional expectation of the complete-data log-likelihood. If the complete-data model is from an exponential family, then $\operatorname { log } L ( \theta | Y _ { aug } )$ is linear in a set of complete-data sufficient statistics. In this common case which includes multivariate normal, Poisson, multinomial, and exponential models (among others), computing $Q ( \theta | \theta ^ { ( t ) } )$ involves routine calculations. The M-step then involves computing the maximum-likelihood estimates as if there were no missing data, by using the imputed complete-data sufficient statistics from the E-step as inputs.

Linear inverse problems with positivity restrictions.

Consider the system of equations

\begin{equation*} g _ { j } = \sum _ { i = 1 } ^ { M } f _ { i } h _ {i j } , j = 1 , \ldots , N, \end{equation*}

where $g _ { j } > 0$, $h_{i j} \geq 0$, $\sum _ { j } h _ { ij } > 0$ are known, and one wishes to solve for $f _ { i } > 0$. Such problems are omnipresent in scientific and engineering applications and can be solved with the EM algorithm [a30]. First one notes that, without loss of generality, one may assume that $\sum _ { j } g _ { j } = \sum _ { i } f _ { i } = \sum _ { j } h _ { i j } = 1$, that is, $g = ( g _ { 1 } , \dots , g _ { N } )$, $f = ( f _ { 1 } , \ldots , f _ { M } )$ and $h _ { i } = ( h _ { i 1 } , \dots , h _ { i N } )$ for $i = 1 , \dots , M$ are discrete probability measures (see [a30] for details). Thus, $f$ can be viewed as the mixing weights in a finite mixture model, i.e., $g$. A finite mixture model is a hierarchical model in that one can suppose that an integer $z \in ( 1 , \dots , M )$ is chosen according to $f$, and then data are generated conditional on $z$ according to $h _ { z }$. The marginal distribution of the data generated is $g$. If one considers $z$ to be missing data, one can derive an EM algorithm of the form

\begin{equation*} f _ { i } ^{( t + 1 ) }= f _ { i }^{ ( t )} \sum _ { j } \left( \frac { h _ { i j } } { \sum _ { k } f _ { k }^{ ( t )} h _ { k j } ) } \right) g _ { j } , t = 1,2 ,\dots \end{equation*}

Y. Vardi and D. Lee [a30] demonstrate that this iteration converges to the value that minimizes the Kullback–Leibler information divergence between $g$ and $\sum _ { i } f _ { i } h _ { i }$ over all non-negative $f$, which is the desired non-negative solution if it exists.

Multivariate $t$-model, a latent variable model.

The previous example illustrates an important principle in the application of the EM algorithm. Namely, one is often initially interested in optimizing a function $f ( \theta )$ and purposely embeds $f ( \theta )$ in a function, $f ( \theta , \phi )$, with a larger domain, such that $f ( \theta ) = \int f ( \theta , \phi ) d \phi$, in order to use the EM algorithm. In the previous example, this was accomplished by identifying $\phi = Y _ { \text{mis} }$ with $z$. Similar strategies are often fruitful in various latent variable models. Consider, for example, the multivariate $t$-model, which is useful for robust estimation (cf. Robust statistics; see also, e.g., [a9], [a11]),

\begin{equation} \tag{a2} f ( y | \mu , \Sigma , \nu ) \propto \end{equation}

\begin{equation*} \propto \| \Sigma \| ^ { - 1 / 2 } [ \nu + ( y - \mu ) ^ { T } \Sigma ^ { - 1 } ( y - \mu ) ] ^ { - ( \nu + p ) / 2 }, \end{equation*}

where $y , \mu \in \mathbf{R} ^ { p }$, $\Sigma$ is a positive-definite $( p \times p )$-matrix, $\theta = ( \mu , \Sigma )$, and $\nu \in \mathbf{R} ^ { + }$ is the known degree of freedom. Given a random sample $\{ y _ { i } : i = 1 , \dots , n \} = Y _ { \operatorname{obs} }$, assumed to be from (a2), one wishes to find the maximizer of $L ( \mu , \Sigma | Y _ { \text{obs} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i } | \mu , \Sigma , \nu )$, which is known to have no general closed-form solution. In order to apply the EM algorithm, one embeds $L ( \mu , \Sigma | Y _ { \operatorname{obs} } )$ into a larger model $L ( \mu , \Sigma | Y _ { \text{aug} } )$. This is accomplished via the well-known representation

\begin{equation} \tag{a3} t = \mu + \frac { \Sigma ^ { 1 / 2 } Z } { \sqrt { q } }, \end{equation}

where $t$ follows the $t$-distribution in (a2), $Z \sim N _ { p } ( 0 , I )$, and $q \sim X _ { \nu } ^ { 2 } / \nu$ is a mean chi-square variable with $\nu$ degrees of freedom independent of $Z$. In (a3), one sees that the distribution of $t$ conditional on $q$ is multivariate normal. Thus, if one had observed $q$, maximum-likelihood estimation would be easy. One therefore defines $Y _ { \text{aug} } = \{ ( y _ { i } , q _ { i } ) : i = 1 , \ldots , n \}$ and $L ( \mu , \Sigma | Y _ { \operatorname{aug} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i } | \mu , \Sigma , \nu , q _ { i } ) f ( q _ { i } | \nu )$, which is easy to maximize as a function of $\mu$ and $\Sigma$ due to the conditional normality of $t$. Thus, since $\operatorname { log } L ( \mu , \Sigma | Y _ {\text{ aug} } )$ is linear in $( q _ { 1 } , \dots , q _ { n } )$, the $( t + 1 )$st iteration of the EM algorithm has a simple E-step that computes

\begin{equation*} w _ { i } ^ { ( t + 1 ) } = \mathsf{E} ( q _ { i } | y _ { i } , \mu ^ { ( t ) } , \Sigma ^ { ( t ) } ) = \frac { \nu + p } { \nu + d _ { i } ^ { ( t ) } } , i = 1 , \dots , n, \end{equation*}

where $d _ { i } ^ { ( t ) } = ( y _ { i } - \mu ^ { ( t ) } ) ^ { T } [ \Sigma ^ { ( t ) } ] ^ { - 1 } ( y _ { i } - \mu ^ { ( t ) } )$. The M-step then maximizes $Q ( \theta | \theta ^ { ( t ) } )$ to obtain

\begin{equation*} \mu ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } y _ { i } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }, \end{equation*}

and

\begin{equation} \tag{a4} \Sigma ^ { ( t + 1 ) } = \frac { 1 } { n } \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i } - \mu ^ { ( t + 1 ) } ) ( y _ { i } - \mu ^ { ( t + 1 ) } ) ^ { T }. \end{equation}

Given the weights $( w _ { i } ^ { ( t + 1 ) } , \ldots , w _ { n } ^ { ( t + 1 ) } )$, the M-step corresponds to weighted least squares, thus this EM algorithm, which often goes by the name iteratively reweighted least squares (IRLS), is trivial to program and use, and, as will be seen in the next section, exhibits unusually stable convergence.

Properties of the EM algorithm.

The EM algorithm has enjoyed wide popularity in many scientific fields from the 1970s onwards (e.g., [a18], [a23]). This is primarily due to easy implementation and stable convergence. As the previous paragraph illustrates, the EM algorithm is often easy to program and use. Although the algorithm may take many iterations to converge relative to other optimization routines (e.g., Newton–Raphson), each iteration is often easy to program and quick to compute. Moreover, the EM algorithm is less sensitive to poor starting values, and can be easier to use with many parameters since the iterations necessarily remain in the parameter space and no second derivatives are required. Finally, the EM algorithm has the very important property that the objective function is increased at each iteration. That is, by the definition of $f ( \phi | \theta )$,

\begin{equation} \tag{a5} \operatorname { log } \int f ( \theta , \phi ) d \phi = \operatorname { log } f ( \theta , \phi ) - \operatorname { log } f ( \phi | \theta ) = \end{equation}

\begin{equation*} = Q ( \theta | \theta ^ { ( t ) } ) - \int \operatorname { log } f ( \phi | \theta ) f ( \phi | \theta ^ { ( t ) } ) d \phi, \end{equation*}

where the second equality follows by averaging over $\phi$ according to $f ( \phi | \theta ^ { ( t ) } )$. Since the first term of (a5) is maximized by $\theta ^ { ( t + 1 ) }$, and the second is minimized by $\theta ^ { ( t ) }$ (under the assumption that the support of $f ( \phi | \theta )$ does not depend on $\theta$), one obtains

\begin{equation*} \operatorname { log } \int f ( \theta ^ { ( t + 1 ) } , \phi ) d \phi \geq \operatorname { log } \int f ( \theta ^ { ( t ) } , \phi ) d \phi \end{equation*}

for $t = 0,1 , \ldots$. This property not only contributes to the stability of the algorithm, but also is very valuable for diagnosing implementation errors.

Although the EM algorithm is not guaranteed to converge to even a local mode (it can converge to a saddle point [a25] or even a local minimum [a1]), this can easily be avoided in practice by using several "overdispersed" starting values. (Details of convergence properties are developed in, a.o., [a4], [a32], [a2], [a21].) Running the EM algorithm with several starting values is also recommended because it can help one to find multiple local modes of $\int f ( \theta , \phi ) d \phi$, an important advantage for statistical analysis.

Extensions and enhancements.

The advantages of the EM algorithm are diminished when either the E-step or the M-step are difficult to compute or the algorithm is very slow to converge. There are a number of extensions to the EM algorithm that can be successful in dealing with these difficulties.

For example, the ECM algorithm ([a20], [a17]) is useful when $Q ( \theta | \theta ^ { ( t ) } )$ is difficult to maximize as a function of $\theta$, but can be easily maximized when $\theta$ is constrained to one of several subspaces of the original parameter space. One assumes that the aggregation of these subspaces is space-filling in the sense that the sequence of conditional maximizations searches the entire parameter space. Thus, ECM replaces the M-step with a sequence of CM-steps (i.e., conditional maximizations) while maintaining the convergence properties of the EM algorithm, including monotone convergence.

The situation is somewhat more difficult when the E-step is difficult to compute, since numerical integration can be very expensive computationally. The Monte-Carlo EM algorithm [a31] suggests using Monte-Carlo integration in the E-step. In some cases, Markov chain Monte-Carlo integration methods have been used successfully (e.g., [a15], [a16], [a22], [a3]). The nested EM algorithm [a27] offers a general strategy for efficient implementation of Markov chain Monte Carlo within the E-step of the EM algorithm.

Much emphasis in recent research is on speeding up the EM algorithm without sacrificing its stability or simplicity. A very fruitful method is through efficient data augmentation, that is, simple and "small" augmentation schemes, where "small" is measured by the so-called Fisher information — see [a23] and the accompanying discussion for details. Briefly, the smaller the augmentation, the better $Q ( \theta | \theta ^ { * } )$ approximates $\int f ( \theta , \phi ) d \phi$ and thus the faster the algorithm; here, $\theta ^ { * }$ is the limit of the EM sequence — note that $Q ( \theta | \theta ^ { * } )$ and $\int f ( \theta , \phi ) d \phi$ share the same stationary point(s) [a4]. Thus, the SAGE [a5], the ECME [a11], and the more general the AECM [a23] algorithms build on the ECM algorithm by allowing this approximation to vary from CM-step to CM-step, in order to obtain a better approximation for some CM-steps as long as this does not complicate the resulting CM-step. X.L. Meng and D.A. van Dyk [a23] also introduce a working parameter into $f ( \theta , \phi )$, that is, they define $f ( \theta , \phi , \alpha )$ such that

\begin{equation*} \int f ( \theta , \phi ) d \phi = \int f ( \theta , \phi , \alpha ) d \phi \end{equation*}

for all $\alpha$ in some class $A$. They then choose $\alpha$ in order to optimize the approximation in terms of the rate of convergence of the resulting algorithm while maintaining the simplicity and stability of the EM algorithm. This results in simple, stable, and fast algorithms in a number of statistical applications. For example, in the $t$-model this procedure results in replacing the update for $\Sigma$ given in (a4) with

\begin{equation*} \Sigma ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i } - \mu ^ { ( t + 1 ) } ) ( y _ { i } - \mu ^ { ( t + 1 ) } ) ^ { T } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }. \end{equation*}

The rest of the algorithm remains the same and this simple change clearly maintains the simplicity of the algorithm but can result in a dramatic reduction in computational time compared to the standard IRLS algorithm.

The parameter-expanded EM algorithm or PXEM algorithm [a12] also works with a working parameter, but rather than conditioning on the optimal $\alpha$ it marginalizes out $\alpha$ by fitting it in each iteration. Detailed discussion and comparison of such conditional augmentation and marginal augmentation in the more general context of stochastic simulations (e.g., the Gibbs sampler) can be found in [a24], [a13], and [a28].

Other acceleration techniques have been developed by combining the EM algorithm with various numerical methods, such as Aitken acceleration (e.g., [a14]), Newton–Raphson [a7], quasi-Newton methods (e.g., [a8]), and conjugate-gradient acceleration [a6]. These methods, however, typically sacrifice monotone convergence and therefore require extra programming and special monitoring.

A final pair of extensions are designed to compute

\begin{equation*} H \equiv - \frac { \partial ^ { 2 } } { \partial \theta . \partial \theta } \int f ( \theta , \phi ) d \phi | _ { \theta = \theta ^ { * } }, \end{equation*}

whose inverse gives a large-sample variance of $\theta ^ { * }$ in statistical analysis. The supplemented EM [a19] and supplemented ECM [a29] algorithms combine the analytically computable

\begin{equation*} \frac { \partial ^ { 2 } } { \partial \theta _ { . } \partial \theta } Q ( \theta | \theta ^ { * } ) = \theta ^ { * } \end{equation*}

with numerical differentiation of the EM mapping in order to compute $H$ via a simple matrix identity.

References

[a1] O. Arslan, P.D.L. Constable, J.T. Kent, "Domains of convergence for the EM algorithm: A cautionary tale in a location estimation problem" Statist. Comput. , 3 (1993) pp. 103–108
[a2] R.A. Boyles, "On the convergence of the EM algorithm" J.R. Statist. Soc. , B45 (1983) pp. 47–50
[a3] J.S.K. Chan, A.Y.C. Kuk, "Maximum likelihood estimation for probit-linear mixed models with correlated random effects" Biometrics , 53 (1997) pp. 86–97
[a4] A.P. Dempster, N.M. Laird, D.B. Rubin, "Maximum likelihood estimation from incomplete-data via the EM algorithm (with discussion)" J. R. Statist. Soc. , B39 (1977) pp. 1–38
[a5] J.A. Fessler, A.O. Hero, "Space-alternating generalized expectation-maximization algorithm" IEEE Trans. Signal Processing , 42 (1994) pp. 2664–77
[a6] M. Jamshidian, R.I. Jennrich, "Conjugate gradient acceleration of the EM algorithm" J. Amer. Statist. Assoc. , 88 (1993) pp. 221–228
[a7] K. Lange, "A gradient algorithm locally equivalent to the EM algorithm" J. R. Statist. Soc. , B57 (1995) pp. 425–438
[a8] K. Lange, "A quasi-Newtonian acceleration of the EM algorithm" Statistica Sinica , 5 (1995) pp. 1–18
[a9] K. Lange, R.J.A. Little, J.M.G. Taylor, "Robust statistical modeling using the t-distribution" J. Amer. Statist. Assoc. , 84 (1989) pp. 881–896
[a10] C. Liu, D.B. Rubin, "The ECME algorithm: a simple extension of EM and ECM with fast monotone convergence" Biometrika , 81 (1994) pp. 633–48
[a11] C. Liu, D.B. Rubin, "ML estimation of the t-distribution using EM and its extensions, ECM and ECME" Statistica Sinica , 5 (1995) pp. 19–40
[a12] C. Liu, D.B. Rubin, Y.N. Wu, "Parameter expansion to accelerate EM: the PX-EM algorithm" Biometrika (1998) pp. 755–770
[a13] J.S. Liu, Y. Wu, "Parameter expansion for data augmentation" J. Amer. Statist. Assoc. , to appear (1999)
[a14] T.A. Louis, "Finding the observed information matrix when using the EM algorithm" J. R. Statist. Soc. , B44 (1982) pp. 226–233
[a15] C.E. McCulloch, "Maximum likelihood variance components estimation for binary data" J. Amer. Statist. Assoc. , 89 (1994) pp. 330–335
[a16] C.E. McCulloch, "Maximum likelihood algorithms for generalized linear mixed models" J. Amer. Statist. Assoc. , 92 (1997) pp. 162–170
[a17] X.L. Meng, "On the rate of convergence of the ECM algorithm" Ann. Math. Stat. , 22 (1994) pp. 326–339
[a18] X.L. Meng, S. Pedlow, "EM: A bibliographic review with missing articles" , Proc. Statist. Comput. Sect. 24-27 Washington, D.C. , Amer. Statist. Assoc. (1992)
[a19] X.L. Meng, D.B. Rubin, "Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm" J. Amer. Statist. Assoc. , 86 (1991) pp. 899–909
[a20] X.L. Meng, D.B. Rubin, "Maximum likelihood estimation via the ECM algorithm: a general framework" Biometrika , 80 (1993) pp. 267–78
[a21] X.L. Meng, D.B. Rubin, "On the global and componentwise rates of convergence of the EM algorithm" LALG , 199 (1994) pp. 413–425
[a22] X.L. Meng, S. Schilling, "Fitting full-information item factor models and an empirical investigation of bridge sampling" J. Amer. Statist. Assoc. , 91 (1996) pp. 1254–1267
[a23] X.L. Meng, D.A. van Dyk, "The EM algorithm — an old folk song sung to a fast new tune (with discussion)" J. R. Statist. Soc. , B59 (1997) pp. 511–567
[a24] X.L. Meng, D.A. van Dyk, "Seeking efficient data augmentation schemes via conditional and marginal augmentation" Biometrika (1999) pp. 301–320
[a25] G.D. Murray, "Comments on the paper by A.P. Dempster, N.M. Laird and D.B. Rubin" J. R. Statist. Soc. , B39 (1977) pp. 27–28
[a26] D.B. Rubin, "Inference and missing data" Biometrika , 63 (1976) pp. 581–592
[a27] D.A. van Dyk, "Nesting EM Algorithms for computational efficiency" Statistica Sinica , to appear (2000)
[a28] D.A. van Dyk, X.L. Meng, "The art of data augmentation" Statist. Sci. , Submitted (1999)
[a29] D.A. van Dyk, X.L. Meng, D.B. Rubin, "Maximum likelihood estimation via the ECM algorithm: computing the asymptotic variance" Statistica Sinica , 5 (1995) pp. 55–75
[a30] Y. Vardi, D. Lee, "From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion)" J. R. Statist. Soc. , B55 (1993) pp. 569–598
[a31] G.C.G. Wei, M.A. Tanner, "A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm" J. Amer. Statist. Assoc. , 85 (1990) pp. 699–704
[a32] C.F.J. Wu, "On the convergence properties of the EM algorithm" Ann. Math. Stat. , 11 (1983) pp. 95–103
How to Cite This Entry:
EM algorithm. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=EM_algorithm&oldid=15013
This article was adapted from an original article by D.A. van DykX.L. Meng (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article