EM algorithm
expectationmaximization algorithm
An iterative optimization procedure [a4] for computing
\begin{equation} \tag{a1} \theta ^ { * } = \operatorname { arg } \operatorname { max } _ { \theta \in \Theta } \int f ( \theta , \phi ) d \phi, \end{equation}
where $f ( \theta , \phi )$ is a nonnegative real valued function which is integrable as a function of $\phi$ for each $\theta$. The EM algorithm was developed for statistical inference in problems with incomplete data or problems that can be formulated as such (e.g., with latentvariable modeling) and is a very popular method of computing maximum likelihood, restricted maximum likelihood, penalized maximum likelihood, and maximum posterior estimates. The name, EM algorithm, stems from this context, where "E" stands for the expectation (i.e., integration) step and "M" stands for the maximization step. In particular, the algorithm starts with an initial value $\theta ^ { ( 0 ) } \in \Theta$ and iterates until convergence the two following steps for $t = 0,1 , \ldots$:

The usefulness of the EM algorithm is apparent when both of these steps can be accomplished with minimal analytic and computation effort but either the direct maximization or the integration in (a1) is difficult. The attractive convergence properties of the algorithm along with its many generalizations and extensions will be discussed below. First, however, an illustration of the algorithm via several applications is given.
Applications.
Maximumlikelihood estimation with missing observations.
The standard notation and terminology for the EM algorithm stem from this important class of applications. In this case the function $f ( \theta , \phi )$ is the completedata likelihood function, which is written $L ( \theta  Y _ { \text{com} } )$, where $Y _ { \text{com} } = ( Y _ { \text{obs} } , Y _ { \text{mis} } )$ is the complete data which consists of the observed data, $Y _ { \text{obs}}$, and the missing data, $Y _ { \text{mis} }$, $\theta$ is a model parameter, and $\phi = Y _ { \text{mis} }$. One is interested in computing the value of $\theta$ that maximizes $L ( \theta  Y _ { \text{obs} } ) = \int L ( \theta  Y _ { \text{com} } ) d Y_{\text{mis}}$, the maximumlikelihood estimate. Here it is assumed that $Y _ { \text{mis} }$ is missing at random [a26]; more complicated missing data mechanisms can also be accounted for in the formulation of $L ( \theta  Y _ { \text{com} } )$. More generally, one can augment $Y _ { \text{obs}}$ to $Y _ { \operatorname { aug } }$, the augmented data, via a manytoone mapping, $Y _ { \text{obs} } = \mathcal{M} ( Y _ { \text{aug} } )$, with the understanding that the augmenteddata likelihood, $L ( \theta  Y _ { \text { aug } } )$, is linked to $L ( \theta  Y _ { \text{obs} } )$ via $L ( \theta  Y _ { \text{obs}} ) = \int _ { \mathcal{M} ( Y _ { \text { aug } } ) = Y _ { \text { obs } } } L ( \theta  Y _ { \text { aug } } ) d Y _ { \text { aug } }$.
The EM algorithm builds on the intuitive notion that:
i) if there were no missing data, maximumlikelihood estimation would be easy; and
ii) if the model parameters were known, the missing data could easily be imputed (i.e., predicted) by its conditional expectation. These two observations correspond to the Mstep and the Estep, respectively, with the proviso that not the missing data, but rather the completedata loglikelihood should be imputed by its conditional expectation. In particular, the Estep reduces to computing
\begin{equation*} Q ( \theta  \theta ^ { ( t ) } ) = \mathsf{E} \left[ \operatorname { log } L ( \theta  Y _ { \text{aug} } )  Y _ { \text{obs} } , \theta ^ { ( t ) } \right], \end{equation*}
the conditional expectation of the completedata loglikelihood. If the completedata model is from an exponential family, then $\operatorname { log } L ( \theta  Y _ { aug } )$ is linear in a set of completedata sufficient statistics. In this common case which includes multivariate normal, Poisson, multinomial, and exponential models (among others), computing $Q ( \theta  \theta ^ { ( t ) } )$ involves routine calculations. The Mstep then involves computing the maximumlikelihood estimates as if there were no missing data, by using the imputed completedata sufficient statistics from the Estep as inputs.
Linear inverse problems with positivity restrictions.
Consider the system of equations
\begin{equation*} g _ { j } = \sum _ { i = 1 } ^ { M } f _ { i } h _ {i j } , j = 1 , \ldots , N, \end{equation*}
where $g _ { j } > 0$, $h_{i j} \geq 0$, $\sum _ { j } h _ { ij } > 0$ are known, and one wishes to solve for $f _ { i } > 0$. Such problems are omnipresent in scientific and engineering applications and can be solved with the EM algorithm [a30]. First one notes that, without loss of generality, one may assume that $\sum _ { j } g _ { j } = \sum _ { i } f _ { i } = \sum _ { j } h _ { i j } = 1$, that is, $g = ( g _ { 1 } , \dots , g _ { N } )$, $f = ( f _ { 1 } , \ldots , f _ { M } )$ and $h _ { i } = ( h _ { i 1 } , \dots , h _ { i N } )$ for $i = 1 , \dots , M$ are discrete probability measures (see [a30] for details). Thus, $f$ can be viewed as the mixing weights in a finite mixture model, i.e., $g$. A finite mixture model is a hierarchical model in that one can suppose that an integer $z \in ( 1 , \dots , M )$ is chosen according to $f$, and then data are generated conditional on $z$ according to $h _ { z }$. The marginal distribution of the data generated is $g$. If one considers $z$ to be missing data, one can derive an EM algorithm of the form
\begin{equation*} f _ { i } ^{( t + 1 ) }= f _ { i }^{ ( t )} \sum _ { j } \left( \frac { h _ { i j } } { \sum _ { k } f _ { k }^{ ( t )} h _ { k j } ) } \right) g _ { j } , t = 1,2 ,\dots \end{equation*}
Y. Vardi and D. Lee [a30] demonstrate that this iteration converges to the value that minimizes the Kullback–Leibler information divergence between $g$ and $\sum _ { i } f _ { i } h _ { i }$ over all nonnegative $f$, which is the desired nonnegative solution if it exists.
Multivariate $t$model, a latent variable model.
The previous example illustrates an important principle in the application of the EM algorithm. Namely, one is often initially interested in optimizing a function $f ( \theta )$ and purposely embeds $f ( \theta )$ in a function, $f ( \theta , \phi )$, with a larger domain, such that $f ( \theta ) = \int f ( \theta , \phi ) d \phi$, in order to use the EM algorithm. In the previous example, this was accomplished by identifying $\phi = Y _ { \text{mis} }$ with $z$. Similar strategies are often fruitful in various latent variable models. Consider, for example, the multivariate $t$model, which is useful for robust estimation (cf. Robust statistics; see also, e.g., [a9], [a11]),
\begin{equation} \tag{a2} f ( y  \mu , \Sigma , \nu ) \propto \end{equation}
\begin{equation*} \propto \ \Sigma \ ^ {  1 / 2 } [ \nu + ( y  \mu ) ^ { T } \Sigma ^ {  1 } ( y  \mu ) ] ^ {  ( \nu + p ) / 2 }, \end{equation*}
where $y , \mu \in \mathbf{R} ^ { p }$, $\Sigma$ is a positivedefinite $( p \times p )$matrix, $\theta = ( \mu , \Sigma )$, and $\nu \in \mathbf{R} ^ { + }$ is the known degree of freedom. Given a random sample $\{ y _ { i } : i = 1 , \dots , n \} = Y _ { \operatorname{obs} }$, assumed to be from (a2), one wishes to find the maximizer of $L ( \mu , \Sigma  Y _ { \text{obs} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i }  \mu , \Sigma , \nu )$, which is known to have no general closedform solution. In order to apply the EM algorithm, one embeds $L ( \mu , \Sigma  Y _ { \operatorname{obs} } )$ into a larger model $L ( \mu , \Sigma  Y _ { \text{aug} } )$. This is accomplished via the wellknown representation
\begin{equation} \tag{a3} t = \mu + \frac { \Sigma ^ { 1 / 2 } Z } { \sqrt { q } }, \end{equation}
where $t$ follows the $t$distribution in (a2), $Z \sim N _ { p } ( 0 , I )$, and $q \sim X _ { \nu } ^ { 2 } / \nu$ is a mean chisquare variable with $\nu$ degrees of freedom independent of $Z$. In (a3), one sees that the distribution of $t$ conditional on $q$ is multivariate normal. Thus, if one had observed $q$, maximumlikelihood estimation would be easy. One therefore defines $Y _ { \text{aug} } = \{ ( y _ { i } , q _ { i } ) : i = 1 , \ldots , n \}$ and $L ( \mu , \Sigma  Y _ { \operatorname{aug} } ) = \prod _ { i = 1 } ^ { n } f ( y _ { i }  \mu , \Sigma , \nu , q _ { i } ) f ( q _ { i }  \nu )$, which is easy to maximize as a function of $\mu$ and $\Sigma$ due to the conditional normality of $t$. Thus, since $\operatorname { log } L ( \mu , \Sigma  Y _ {\text{ aug} } )$ is linear in $( q _ { 1 } , \dots , q _ { n } )$, the $( t + 1 )$st iteration of the EM algorithm has a simple Estep that computes
\begin{equation*} w _ { i } ^ { ( t + 1 ) } = \mathsf{E} ( q _ { i }  y _ { i } , \mu ^ { ( t ) } , \Sigma ^ { ( t ) } ) = \frac { \nu + p } { \nu + d _ { i } ^ { ( t ) } } , i = 1 , \dots , n, \end{equation*}
where $d _ { i } ^ { ( t ) } = ( y _ { i }  \mu ^ { ( t ) } ) ^ { T } [ \Sigma ^ { ( t ) } ] ^ {  1 } ( y _ { i }  \mu ^ { ( t ) } )$. The Mstep then maximizes $Q ( \theta  \theta ^ { ( t ) } )$ to obtain
\begin{equation*} \mu ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } y _ { i } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }, \end{equation*}
and
\begin{equation} \tag{a4} \Sigma ^ { ( t + 1 ) } = \frac { 1 } { n } \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i }  \mu ^ { ( t + 1 ) } ) ( y _ { i }  \mu ^ { ( t + 1 ) } ) ^ { T }. \end{equation}
Given the weights $( w _ { i } ^ { ( t + 1 ) } , \ldots , w _ { n } ^ { ( t + 1 ) } )$, the Mstep corresponds to weighted least squares, thus this EM algorithm, which often goes by the name iteratively reweighted least squares (IRLS), is trivial to program and use, and, as will be seen in the next section, exhibits unusually stable convergence.
Properties of the EM algorithm.
The EM algorithm has enjoyed wide popularity in many scientific fields from the 1970s onwards (e.g., [a18], [a23]). This is primarily due to easy implementation and stable convergence. As the previous paragraph illustrates, the EM algorithm is often easy to program and use. Although the algorithm may take many iterations to converge relative to other optimization routines (e.g., Newton–Raphson), each iteration is often easy to program and quick to compute. Moreover, the EM algorithm is less sensitive to poor starting values, and can be easier to use with many parameters since the iterations necessarily remain in the parameter space and no second derivatives are required. Finally, the EM algorithm has the very important property that the objective function is increased at each iteration. That is, by the definition of $f ( \phi  \theta )$,
\begin{equation} \tag{a5} \operatorname { log } \int f ( \theta , \phi ) d \phi = \operatorname { log } f ( \theta , \phi )  \operatorname { log } f ( \phi  \theta ) = \end{equation}
\begin{equation*} = Q ( \theta  \theta ^ { ( t ) } )  \int \operatorname { log } f ( \phi  \theta ) f ( \phi  \theta ^ { ( t ) } ) d \phi, \end{equation*}
where the second equality follows by averaging over $\phi$ according to $f ( \phi  \theta ^ { ( t ) } )$. Since the first term of (a5) is maximized by $\theta ^ { ( t + 1 ) }$, and the second is minimized by $\theta ^ { ( t ) }$ (under the assumption that the support of $f ( \phi  \theta )$ does not depend on $\theta$), one obtains
\begin{equation*} \operatorname { log } \int f ( \theta ^ { ( t + 1 ) } , \phi ) d \phi \geq \operatorname { log } \int f ( \theta ^ { ( t ) } , \phi ) d \phi \end{equation*}
for $t = 0,1 , \ldots$. This property not only contributes to the stability of the algorithm, but also is very valuable for diagnosing implementation errors.
Although the EM algorithm is not guaranteed to converge to even a local mode (it can converge to a saddle point [a25] or even a local minimum [a1]), this can easily be avoided in practice by using several "overdispersed" starting values. (Details of convergence properties are developed in, a.o., [a4], [a32], [a2], [a21].) Running the EM algorithm with several starting values is also recommended because it can help one to find multiple local modes of $\int f ( \theta , \phi ) d \phi$, an important advantage for statistical analysis.
Extensions and enhancements.
The advantages of the EM algorithm are diminished when either the Estep or the Mstep are difficult to compute or the algorithm is very slow to converge. There are a number of extensions to the EM algorithm that can be successful in dealing with these difficulties.
For example, the ECM algorithm ([a20], [a17]) is useful when $Q ( \theta  \theta ^ { ( t ) } )$ is difficult to maximize as a function of $\theta$, but can be easily maximized when $\theta$ is constrained to one of several subspaces of the original parameter space. One assumes that the aggregation of these subspaces is spacefilling in the sense that the sequence of conditional maximizations searches the entire parameter space. Thus, ECM replaces the Mstep with a sequence of CMsteps (i.e., conditional maximizations) while maintaining the convergence properties of the EM algorithm, including monotone convergence.
The situation is somewhat more difficult when the Estep is difficult to compute, since numerical integration can be very expensive computationally. The MonteCarlo EM algorithm [a31] suggests using MonteCarlo integration in the Estep. In some cases, Markov chain MonteCarlo integration methods have been used successfully (e.g., [a15], [a16], [a22], [a3]). The nested EM algorithm [a27] offers a general strategy for efficient implementation of Markov chain Monte Carlo within the Estep of the EM algorithm.
Much emphasis in recent research is on speeding up the EM algorithm without sacrificing its stability or simplicity. A very fruitful method is through efficient data augmentation, that is, simple and "small" augmentation schemes, where "small" is measured by the socalled Fisher information — see [a23] and the accompanying discussion for details. Briefly, the smaller the augmentation, the better $Q ( \theta  \theta ^ { * } )$ approximates $\int f ( \theta , \phi ) d \phi$ and thus the faster the algorithm; here, $\theta ^ { * }$ is the limit of the EM sequence — note that $Q ( \theta  \theta ^ { * } )$ and $\int f ( \theta , \phi ) d \phi$ share the same stationary point(s) [a4]. Thus, the SAGE [a5], the ECME [a11], and the more general the AECM [a23] algorithms build on the ECM algorithm by allowing this approximation to vary from CMstep to CMstep, in order to obtain a better approximation for some CMsteps as long as this does not complicate the resulting CMstep. X.L. Meng and D.A. van Dyk [a23] also introduce a working parameter into $f ( \theta , \phi )$, that is, they define $f ( \theta , \phi , \alpha )$ such that
\begin{equation*} \int f ( \theta , \phi ) d \phi = \int f ( \theta , \phi , \alpha ) d \phi \end{equation*}
for all $\alpha$ in some class $A$. They then choose $\alpha$ in order to optimize the approximation in terms of the rate of convergence of the resulting algorithm while maintaining the simplicity and stability of the EM algorithm. This results in simple, stable, and fast algorithms in a number of statistical applications. For example, in the $t$model this procedure results in replacing the update for $\Sigma$ given in (a4) with
\begin{equation*} \Sigma ^ { ( t + 1 ) } = \frac { \sum _ { i } w _ { i } ^ { ( t + 1 ) } ( y _ { i }  \mu ^ { ( t + 1 ) } ) ( y _ { i }  \mu ^ { ( t + 1 ) } ) ^ { T } } { \sum _ { i } w _ { i } ^ { ( t + 1 ) } }. \end{equation*}
The rest of the algorithm remains the same and this simple change clearly maintains the simplicity of the algorithm but can result in a dramatic reduction in computational time compared to the standard IRLS algorithm.
The parameterexpanded EM algorithm or PXEM algorithm [a12] also works with a working parameter, but rather than conditioning on the optimal $\alpha$ it marginalizes out $\alpha$ by fitting it in each iteration. Detailed discussion and comparison of such conditional augmentation and marginal augmentation in the more general context of stochastic simulations (e.g., the Gibbs sampler) can be found in [a24], [a13], and [a28].
Other acceleration techniques have been developed by combining the EM algorithm with various numerical methods, such as Aitken acceleration (e.g., [a14]), Newton–Raphson [a7], quasiNewton methods (e.g., [a8]), and conjugategradient acceleration [a6]. These methods, however, typically sacrifice monotone convergence and therefore require extra programming and special monitoring.
A final pair of extensions are designed to compute
\begin{equation*} H \equiv  \frac { \partial ^ { 2 } } { \partial \theta . \partial \theta } \int f ( \theta , \phi ) d \phi  _ { \theta = \theta ^ { * } }, \end{equation*}
whose inverse gives a largesample variance of $\theta ^ { * }$ in statistical analysis. The supplemented EM [a19] and supplemented ECM [a29] algorithms combine the analytically computable
\begin{equation*} \frac { \partial ^ { 2 } } { \partial \theta _ { . } \partial \theta } Q ( \theta  \theta ^ { * } ) = \theta ^ { * } \end{equation*}
with numerical differentiation of the EM mapping in order to compute $H$ via a simple matrix identity.
References
[a1]  O. Arslan, P.D.L. Constable, J.T. Kent, "Domains of convergence for the EM algorithm: A cautionary tale in a location estimation problem" Statist. Comput. , 3 (1993) pp. 103–108 
[a2]  R.A. Boyles, "On the convergence of the EM algorithm" J.R. Statist. Soc. , B45 (1983) pp. 47–50 
[a3]  J.S.K. Chan, A.Y.C. Kuk, "Maximum likelihood estimation for probitlinear mixed models with correlated random effects" Biometrics , 53 (1997) pp. 86–97 
[a4]  A.P. Dempster, N.M. Laird, D.B. Rubin, "Maximum likelihood estimation from incompletedata via the EM algorithm (with discussion)" J. R. Statist. Soc. , B39 (1977) pp. 1–38 
[a5]  J.A. Fessler, A.O. Hero, "Spacealternating generalized expectationmaximization algorithm" IEEE Trans. Signal Processing , 42 (1994) pp. 2664–77 
[a6]  M. Jamshidian, R.I. Jennrich, "Conjugate gradient acceleration of the EM algorithm" J. Amer. Statist. Assoc. , 88 (1993) pp. 221–228 
[a7]  K. Lange, "A gradient algorithm locally equivalent to the EM algorithm" J. R. Statist. Soc. , B57 (1995) pp. 425–438 
[a8]  K. Lange, "A quasiNewtonian acceleration of the EM algorithm" Statistica Sinica , 5 (1995) pp. 1–18 
[a9]  K. Lange, R.J.A. Little, J.M.G. Taylor, "Robust statistical modeling using the tdistribution" J. Amer. Statist. Assoc. , 84 (1989) pp. 881–896 
[a10]  C. Liu, D.B. Rubin, "The ECME algorithm: a simple extension of EM and ECM with fast monotone convergence" Biometrika , 81 (1994) pp. 633–48 
[a11]  C. Liu, D.B. Rubin, "ML estimation of the tdistribution using EM and its extensions, ECM and ECME" Statistica Sinica , 5 (1995) pp. 19–40 
[a12]  C. Liu, D.B. Rubin, Y.N. Wu, "Parameter expansion to accelerate EM: the PXEM algorithm" Biometrika (1998) pp. 755–770 
[a13]  J.S. Liu, Y. Wu, "Parameter expansion for data augmentation" J. Amer. Statist. Assoc. , to appear (1999) 
[a14]  T.A. Louis, "Finding the observed information matrix when using the EM algorithm" J. R. Statist. Soc. , B44 (1982) pp. 226–233 
[a15]  C.E. McCulloch, "Maximum likelihood variance components estimation for binary data" J. Amer. Statist. Assoc. , 89 (1994) pp. 330–335 
[a16]  C.E. McCulloch, "Maximum likelihood algorithms for generalized linear mixed models" J. Amer. Statist. Assoc. , 92 (1997) pp. 162–170 
[a17]  X.L. Meng, "On the rate of convergence of the ECM algorithm" Ann. Math. Stat. , 22 (1994) pp. 326–339 
[a18]  X.L. Meng, S. Pedlow, "EM: A bibliographic review with missing articles" , Proc. Statist. Comput. Sect. 2427 Washington, D.C. , Amer. Statist. Assoc. (1992) 
[a19]  X.L. Meng, D.B. Rubin, "Using EM to obtain asymptotic variancecovariance matrices: the SEM algorithm" J. Amer. Statist. Assoc. , 86 (1991) pp. 899–909 
[a20]  X.L. Meng, D.B. Rubin, "Maximum likelihood estimation via the ECM algorithm: a general framework" Biometrika , 80 (1993) pp. 267–78 
[a21]  X.L. Meng, D.B. Rubin, "On the global and componentwise rates of convergence of the EM algorithm" LALG , 199 (1994) pp. 413–425 
[a22]  X.L. Meng, S. Schilling, "Fitting fullinformation item factor models and an empirical investigation of bridge sampling" J. Amer. Statist. Assoc. , 91 (1996) pp. 1254–1267 
[a23]  X.L. Meng, D.A. van Dyk, "The EM algorithm — an old folk song sung to a fast new tune (with discussion)" J. R. Statist. Soc. , B59 (1997) pp. 511–567 
[a24]  X.L. Meng, D.A. van Dyk, "Seeking efficient data augmentation schemes via conditional and marginal augmentation" Biometrika (1999) pp. 301–320 
[a25]  G.D. Murray, "Comments on the paper by A.P. Dempster, N.M. Laird and D.B. Rubin" J. R. Statist. Soc. , B39 (1977) pp. 27–28 
[a26]  D.B. Rubin, "Inference and missing data" Biometrika , 63 (1976) pp. 581–592 
[a27]  D.A. van Dyk, "Nesting EM Algorithms for computational efficiency" Statistica Sinica , to appear (2000) 
[a28]  D.A. van Dyk, X.L. Meng, "The art of data augmentation" Statist. Sci. , Submitted (1999) 
[a29]  D.A. van Dyk, X.L. Meng, D.B. Rubin, "Maximum likelihood estimation via the ECM algorithm: computing the asymptotic variance" Statistica Sinica , 5 (1995) pp. 55–75 
[a30]  Y. Vardi, D. Lee, "From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion)" J. R. Statist. Soc. , B55 (1993) pp. 569–598 
[a31]  G.C.G. Wei, M.A. Tanner, "A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm" J. Amer. Statist. Assoc. , 85 (1990) pp. 699–704 
[a32]  C.F.J. Wu, "On the convergence properties of the EM algorithm" Ann. Math. Stat. , 11 (1983) pp. 95–103 
EM algorithm. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=EM_algorithm&oldid=50489