%%% Title of object: Mixture Models
%%% Canonical Name: MixtureModels
%%% Type: Definition
%%% Created on: 2010-09-21 08:54:04
%%% Modified on: 2010-09-22 06:52:26
%%% Creator: Mixwilli
%%% Modifier: jkimmel
%%% Author: Mixwilli
%%%
%%% Classification: msc:62E10, msc:62F10, msc:62G05, msc:62H30
%%% Keywords: Cluster Analysis, EM Algorithm, Likelihood Estimation, Bayesian Classification, Number of Components, Identifiability, Robustness, Spurious Estimates
%%% Preamble:
\documentclass[10pt]{article}
% this is the default PlanetMath preamble. as your knowledge
% of TeX increases, you will probably want to edit this, but
% it should be fine as is for beginners.
% almost certainly you want these
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
% used for TeXing text within eps files
%\usepackage{psfrag}
% need this for including graphics (\includegraphics)
%\usepackage{graphicx}
% for neatly defining theorems and propositions
%\usepackage{amsthm}
% making logically defined graphics
%\usepackage{xypic}
% there are many more packages, add them here as you need them
% define commands here
%%% Content:
\begin{document}
\author{Wilfried Seidel \\\\ Helmut-Schmidt-Universit\"{a}t, D-22039
Hamburg, Germany}
%\date{}
\title{Mixture Models}
%
\maketitle
%
\section{Introduction}
%
Mixture distributions are convex combinations of ''component''
distributions. In statistics, these are standard tools for modelling
heterogeneity in the
sense that different elements of a sample may belong to different
components. However, they may also be used
simply as flexible instruments for achieving a good fit to data
when standard distributions fail. As good software for fitting
mixtures is available, these play an increasingly
important role in nearly every field of statistics.
It is convenient to explain finite mixtures (i.e. finite convex
combinations) as theoretical models for cluster analysis, but of
course the range of
applicability is not at all restricted to the
clustering context. Suppose that a feature vector $X$ is observed in a
heterogeneous population, which consists of $k$ homogeneous subpopulations,
the ''components''. It is assumed that for $i = 1, \dots, k$, $X$ is
distributed in the i-th component according to a
(discrete or continuous) density $f(x, \, \theta_{i})$ (the ''component
density''), and all component densities belong to a common parametric
family $\{f(x, \theta), \; \theta \in \Theta\}$, the ''component model''.
The relative proportion of the i-th component in the whole population
is $p_{i}$, $p_{1} + \dots + p_{k} = 1$. Now suppose that an item is
drawn randomly from the population. Then it belongs to the
i-th component with probability $p_{i}$, and the conditional probability
that $X$ falls in some set $A$ is $\mbox{Pr}\,(X \in A \;| \; \theta_{i})$,
calculated
from the density $f(x, \, \theta_{i})$. Consequently, the marginal probability
is
$$\mbox{Pr}\,(X \in A \; | \; P) = p_{1} \,
\mbox{Pr}\,(X \in A \; | \; \theta_{1}) +
\dots + p_{k} \, \mbox{Pr}\,(X \in A \; | \; \theta_{k})$$
with density
%
\begin{equation}
\label{mixture density}
f(x, \, P) = p_{1} f(x, \, \theta_{1}) + \dots + p_{k} f(x, \, \theta_{k}),
\end{equation}
%
a ''simple finite mixture'' with parameter $P = ((p_{1}, \dots, p_{k}),
(\theta_{1}, \dots , \theta_{k}))$.
The components $p_{i}$ of $P$ are called ''mixing weights'', the
$\theta_{i}$ ''component
paramaters''. For fixed $k$, let ${\cal P}_{k}$ be the set of all
vectors $P$ of this type, with $\theta_{i} \in \Theta$ and nonnegative
mixing weights summing up to one. Then ${\cal P}_{k}$
parameterizes all mixtures with not more than $k$ components.
If all mixing weights are positive and component densities are different,
then $k$ is the exact number of components.
% (dimension kd + k - 1, if d = dimension of \Theta)
The set of all simple finite mixtures is parameterized by
${\cal P}_{\mbox{fin}}$, the union of all ${\cal P}_{k}$.
This model can be extendes in various ways. For example, all component
densities may contain additional common parameters (variance parameters,
say), they may depend on covariables (mixtures of regression models),
and also the mixing weights may depend on covariables. Mixtures on
time series models are also considered.
Here I shall concentrate on simple mixtures, as all relevant concepts
can be explained very easily in this setting. These need not be finite
convex combinations; there is an alternative and more general definition
of simple mixtures: Observe that the parameter $P$ can
be considered as a discrete probability distribution on $\Theta$ which
assigns probability mass $p_{i}$ to the parameter $\theta_{i}$. Then
equation (\ref{mixture density})
is an integral with respect to this distribution, and if $\xi$ is
an arbitrary probability distribution $\Theta$, a mixture can be defined by
%
\begin{equation}
\label{general mixture}
f(x, \, \xi) = \int_{\Theta} f(x, \, \theta) \, d \xi(\theta) \; .
\end{equation}
%
It can be considered as the distribution of a two-stage experiment:
First, choose a parameter $\theta$ according to the distribution $\xi$,
then choose $x$ according to $f(x, \, \theta)$. Here, $\xi$ is called a
''mixing distribution'', and mixture models of this type can be
parameterized over every set $\Xi$ of probability
distributions on $\Theta$.
In statistical applications of mixture models, a nontrivial key
issue is identifiability, meaning that different parameters describe
different mixtures. In a trivial sense, models parametrized over
vectors $P$ are never identifiable: All vectors that correspond to
the same probability distribution on $\Theta$ describe the same
mixture model. For example, any permutation of the sequence of
components leaves the mixing distribution unchanged, or components
may be added with zero mixing weights. Therefore identifiability
can only mean that parameters that correspond to different mixing
distributions describe different mixture models. However, also in
this sense identifiability is often violated. For example, the mixture
of two uniform distributions with supports $[0,\, 0,5]$ and $[0.5,\, 1]$
and equal mixing weights is the uniform distribution with support
$[0, \, 1]$. On the other hand, finite mixtures
of many standard families (normal, Poisson, ...) are identifiable,
see for example Titterington \cite{Titterington}. Identifiability of
mixtures of regression models has been treated among others by
Hennig \cite{Hennig 1}. A standard general reference for finite
mixture models is McLachlan and Peel \cite{McLachlan}.
\section{Statistical Problems}
Consider a mixture model with parameter $\eta$ (vector or probalility
measure). In the simplest case, one has i.i.d. data $x_{1}, \dots, x_{n}$
from $f(x, \, \eta)$,
from which one wants to gain information about $\eta$. Typical questions
are estimation of (parameters of) $\eta$, or mixture diagnostics:
Is there strong evidence for a mixture (in contrast to homogeneity in
the sense that $\eta$ is concentrated at some single parameter $\theta$)?
What is the (minimum) number of mixture components?
A variety of techniques has been developed. The data provide at least
implicitly an estimate of the mixture, and equations \ref{mixture density}
and \ref{general mixture} show that mixture and mixing distribution are
related by a linear (integral) equation. Approximate solution techniques
have been applied for obtaining estimators, and moment estimators have been
developed on basis of this structure. Distance estimators exhibit nice
properties. Traditionally, mixture diagnostics
has been handled by graphical methods. More recent approaches for estimation
and diagnostics are based on Bayesian or likelihood techniques;
likelihood methods will be addressed below. Although Bayesian
methods have some advantages over likelihood methods, they are
not straightforward (for example, usually no ''natural'' conjugate priors are
available, therefore posteriors are simulated using MCMC. Choice of
''noninformative'' priors is not obvious, as improper priors usually lead to
improper posteriors. Nonidentifiability of ${\cal P}_{k}$ causes the problem of
''label switching''). A nice reference for Bayesian methods is
Fr\"{u}hwirth-Schnatter \cite{Fruehwirth}.
Let me close this section with a short discussion of robustness. Robustness
with respect to outliers is treated by Hennig \cite{Hennig 2}.
Another problem is that mixture models are extremely nonrobust with
respect to misspecification of the component model. Estimating the
component model in a fully nonparametric way is of course not possible,
but manageable alternatives are for example mixtures of log-concave
distributions. Let me point out, however, that issues like
nonrobustness and nonidentifiability only cause problems if the
task is to interpret the model parameters somehow. If the aim
is only to obtain a better data fit, one need not worry about them.
\section{Likelihood Methods}
In the above setting, $l(\eta) = \log (f(x_{1}, \, \eta)) + \dots +
\log (f(x_{n}, \, \eta))$ is the log likelihood function. It may
have some undesirable properties: First, the
log likelihood is often unbounded. For example, consider
mixtures of normals. If the expectation of one component is fixed
at some data point and the variance goes to zero, the likelihood
goes to infinity. Singularities usually occur at the boundary of the
parameter space. Second, the likelihood function is usually not
unimodal, although this depends on the parametrization. For
example, if the parameter is a probability distribution as
in equation~\ref{general mixture} and if the parameter space
$\Xi$ is a convex set (with respect to the usual linear
combination of measures), the log likelihood function is concave.
If it is bounded, there is a nice theory of ''nonparametric
likelihood estimation'' (Lindsay \cite{Lindsay}), and ''the''
''nonparametric maximum likelihood estimator''
is in some sense uniquely defined and can be calculated numerically
(B\"{o}hning~\cite{Boehning}, Schlattmann \cite{Schlattmann}).
Nonparametric methods, however, work only for low dimensional
component models, whereas ''parametric'' estimation techniques like
the Expectation-Maximization (EM) method work for nearly any dimension.
The latter is a local
maximizer for mixture likelihoods in ${\cal P}_{k}$. Here the mixture
likelihood is
usually multimodal; moreover, it can be very flat. Analytic expressions
for likelihood maxima usually do not exist, they have to be calculated
numerically. On the other hand, even for unbounded likelihoods, it is
known from asymptotic theory, that the simple heuristics of searching
for a large local maximum in the interior of the
parameter space may lead to reasonable estimators. However, one must be
aware that there exist ''spurious'' large local maxima that are statistically
meaningless. Moreover, except from simple cases, there is no manageable
asymptotics for likelihood ratio.
Some of the problems of pure likelihood approaches can be overcome by
considering penalized likelihoods. However, here one has the problem
of choosing a penalization parameter.
Moreover, the EM algorithm is a basic tool for a number of estimation
problems, and it has a very simple structure for simple finite
mixtures. Therefore it will be outlined in the next sextion.
\section{EM Algorithm}
The EM algorithm is a local maximization technique for the
log likelihood in ${\cal P}_{k}$. It starts from the complete-data
log-likelihood. Suppose that for observation $x_{i}$ the
(fictive) component membership is known. It is defined by a vector
$z_{i} \in \Re^{k}$ with
$z_{ij} = 1$, if $x_{i}$ belongs to j-th component, and zero elsewhere.
As a random variable
$Z_{i}$, it has a multinomial distribution with parameters
$k, p_{1}, \dots, p_{k}$. Then
the complete data likelihood and log likelihood of $P$, respectively, are
$L_{c}(P) = \prod_{i=1}^{n} \prod_{j=1}^{k} (p_{j} \,
f(x_{i}, \, \theta_{j}))^{z_{ij}}$
and
$l_{c}(P) = \log(L_{c}(P)) = \sum_{i=1}^{n} \sum_{j=1}^{k} z_{ij} \, \log p_{j} +
\sum_{i=1}^{n} \sum_{j=1}^{k} z_{ij} \, \log f(x_{i}, \, \theta_{j})$.
The EM needs a starting value $P_{0}$, and then proceeds as an iteration
between an ''E-step'' and an ''M-step'' until ''convergence''. The first
E-step consists in calculating the conditional expectation
$E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n})$
of $l_{c}(P)$ for arbitrary $P$, given the data, under $P_{0}$. However,
as the only randomness is in
the $z_{ij}$, we obtain
$$E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n}) =
\sum_{i=1}^{n} \sum_{j=1}^{k} \tau_{j}(x_{i}|P_{0}) \log p_{j} +
\sum_{i=1}^{n} \sum_{j=1}^{k} \tau_{j}(x_{i}|P_{0})
\log f(x_{i}, \, \theta_{j}) \, , $$
where
$$\tau_{j}(x_{i}|P_{0}) =
\mbox{Pr}_{P_{0}}(Z_{ij}=1 \, | \, x_{i}) =
\frac{p_{j} f(x_{i}, \, \theta_{j})}{f(x_{i}, \, P_{0})}$$
is the conditional probability that the i-th observation belongs to
component~j, given the data, with respect to $P_{0}$.
In the following M-step, $E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n})$
is maximized with respect to $P$.
As it is the sum of terms depending on the mixing weights and on the
parameters only,
respectively, both parts can be maximized separately. It is easily
shown that the
maximum in the $p_{j}$ is achieved for
$p_{j}^{(1)} = (1/n) \sum_{i=1}^{n}\tau_{j}(x_{i}|P_{0}), j = 1, \dots, n$.
For component densities from exponential families, similar simple
solutions exist for the $\theta_{j}$,
therefore both the E-step and te M-step can be carried out here
analytically. It can be shown that (i) the log-likelihood is not
decreasing during the iteration
of the EM, and (ii) tat under some regularity conditions it converges to a
stationary point of the likelihood function. However, this may also be a
saddle point.
It remains to define the stopping rule and the starting point(s).
Both are crucial, and the reader is referred to the literature.
There are also techniques that prevent from convergence to
singularities or spurious maxima. A final nice issue of the
EM is that it yields a simple tool for classification of data
points: If $\hat{P}$ is an estimator, then
$\tau_{j}(x_{i}|\hat{P})$ is the posterior probability that $x_{i}$
belongs to class $j$ with respect to the ''prior'' $\hat{P}$. The Bayesian
classification rule assigns observation $i$ to the class $j$ that
maximizes $\tau_{j}(x_{i}|\hat{P})$, and the $\tau_{j}(x_{i}|\hat{P})$
measure the plausibility of such a clustering.
Let me finally remark that the Bayesian analysis of a mixture model via the
Gibbs sampler takes advantage of exactly the same structure by simulating
both the parameters and the missing data (see, e.g., Fr\"{u}hwirth-Schnatter,
\cite{Fruehwirth}).
\section{Number of Components, Testing and Asymptotics}
Even if one has an estimator in each ${\cal P}_{k}$ from the EM,
the question is how to assess the number of components (i.e. how
to choose $k$). Usually information criteria like AIC
and BIC are recommended. An alternative is to perform a sequence of tests
of $k$ against $k+1$ components, for $k = 1, 2 \dots $.
There are several tests for homogeneity, i.e.
for the ''component model'', as for example goodness of fit or dispersion
score tests. For testing $k_{0}$ against $k_{1}$ components, a likelihood
ratio test may be performed. However, the usual $\chi^{2}$-asymptotics
fails, so critical values have to be simulated. Moreover, the distribution
of the test statistic
usually depeds on the specific parameter under the null hypothesis.
Therefore some sort of bootstrap is needed, and as
estimators have to be calculated numerically,
likelihood ratio tests are computationally intensive.
Let me close with some remarks on asymptotics. Whereas asymptotic
normality of estimators is guaranteed
under some conditions, the usual asymptotics for the likelihood ratio
test fails. The reason is that under the null hypothesis, the
parameter $P_{0}$ is on the boundary of the parameter space,
it is not identifiable and the Fisher information matrix in $P_{0}$
is singular. There is an asymptotic theory under certain restrictive
assumptions, but it is usually hard to calculate critical values from it.
%
\begin{thebibliography}{99}
\bibitem{Boehning} B\"{o}hning, D. (2000), {\em Finite Mixture
Models}, Chapman \& Hall, Boca Raton.
\bibitem{Fruehwirth} Fr\"{u}hwirth-Schnatter, S. (2006),
{\em Finite Mixture and Markov Switching Models}, Springer, New York.
\bibitem{Hennig 1} Hennig, C. (2000), {\em Identifiability of Models
for Clusterwise Linear Regression}, Journal of Classification 17,
273-296.
\bibitem{Hennig 2} Hennig, C. (2004), {\em Breakdown points for
ML estimators of location-scale mixtures}, Annals of Statistics
32, 1313-1340.
\bibitem{Lindsay} Lindsay, B. G. (1995), {\em Mixture Models:
Theory, Geometry and Applications}, NSC-CBMS Regional
Conference Series in Probability and Statistics, Volume 5.
\bibitem{McLachlan} McLachlan, G. J., Peel, D. (2000),
{\em Finite Mixture Models}, Wiley, New York.
\bibitem{Schlattmann} Schlattmann, P. (2009), {Medical
Applications of Finite Mixture Models }. Spinger, Berlin,
Heidelberg.
\bibitem{Titterington} Titterington, D. M.; Smith, A. F. M,
Makov, U. E. (1985), {\em Statistical Analysis of Finite Mixture
Distributions}, Wiley, New York.
\end{thebibliography}
%
\end{document}