Difference between revisions of "Dirichlet process"

Latest revision as of 19:35, 5 June 2020

The Dirichlet process provides one means of placing a probability distribution on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also Bayesian approach). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [a4], [a5].

The Dirichlet process is indexed by its parameter, a non-null, finite measure $ \alpha $. Formally, consider a space $ {\mathcal X} $ with a collection of Borel sets $ {\mathcal B} $ on $ {\mathcal X} $. The random probability distribution $ P $ has a Dirichlet process prior distribution with parameter $ \alpha $, denoted by $ {\mathcal D} _ \alpha $, if for every measurable partition $ \{ A _ {1} \dots A _ {m} \} $ of $ {\mathcal X} $ the random vector $ ( P ( A _ {1} ) \dots P ( A _ {m} ) ) $ has the Dirichlet distribution with parameter vector $ ( \alpha ( A _ {1} ) \dots \alpha ( A _ {m} ) ) $.

When a prior distribution is put on $ {\mathcal X} $, then for every measurable subset $ A $ of $ {\mathcal X} $, the quantity $ P ( A ) $ is a random variable. Then $ \alpha _ {0} = {\alpha / {\alpha ( {\mathcal X} ) } } $ is a probability measure on $ {\mathcal X} $. From the definition one sees that if $ P \sim {\mathcal D} _ \alpha $, then $ {\mathsf E} P ( A ) = \alpha _ {0} ( A ) $.

An alternative representation of the Dirichlet process is given in [a6]: Let $ B _ {1} , B _ {2} , \dots $ be independent and identically distributed $ { \mathop{\rm Beta} } ( 1, \alpha ( {\mathcal X} ) ) $ random variables, and let $ V _ {1} , V _ {2} , \dots $ be a sequence of independent and identically distributed random variables with distribution $ \alpha _ {0} ( A ) $, and independent of the random variables $ B $. Define $ B _ {0} = 0 $, and $ P _ {i} = B _ {i} \prod _ {j = 0 } ^ {i - 1 } ( 1 - B _ {j} ) $. The random distribution $ \sum _ {i = 1 } ^ \infty P _ {i} \delta _ {V _ {i} } $ has the distribution $ {\mathcal D} _ \alpha $. Here, $ \delta _ {a} $ represents the point mass at $ a $. This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure $ \alpha $. For example, as $ \alpha ( {\mathcal X} ) \rightarrow \infty $, $ {\mathcal D} _ \alpha $ converges to the point mass at $ \alpha _ {0} $( in the weak topology induced by $ {\mathcal B} $); and as $ \alpha ( {\mathcal X} ) \rightarrow 0 $, $ {\mathcal D} _ \alpha $ converges to the random distribution which is degenerate at a point $ V $, whose location has distribution $ \alpha _ {0} $.

The Dirichlet process is conjugate, in that if $ P \sim {\mathcal D} _ \alpha $, and data points $ X _ {1} \dots X _ {n} $ independent and identically drawn from $ P $ are observed, then the conditional distribution of $ P $ given $ X _ {1} \dots X _ {n} $ is $ {\mathcal D} _ {\alpha + \sum _ {i = 1 } ^ {n} \delta _ {X _ {i} } } $. This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.

An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $. Suppose that for every $ \theta \in \Theta $, $ \alpha _ \theta ( {\mathcal X} ) $ is a positive constant, and let $ \alpha _ \theta = \alpha _ \theta ( {\mathcal X} ) \cdot \alpha _ {\theta,0 } $. If $ \nu $ is a probability distribution on $ \Theta $, and if, first, $ \theta $ is chosen from $ \nu $, and then $ P $ is chosen from $ {\mathcal D} _ {\alpha _ \theta } $, one says that the prior on $ P $ is a mixture of Dirichlet processes (with parameter $ ( \{ \alpha _ \theta \} _ {\theta \in \Theta } , \nu ) $). A reference for this is [a1]. Often, $ \alpha _ \theta ( {\mathcal X} ) \equiv M $, i.e., the constants $ \alpha _ \theta ( {\mathcal X} ) $ do not depend on $ \theta $. In this case, large values of $ M $ indicate that the prior on $ P $ is "concentrated around the parametric family aq,0qQ" . More precisely, as $ M \rightarrow \infty $, the distribution of $ P $ converges to $ \int {\alpha _ {\theta,0 } } {\nu ( d \theta ) } $, the standard Bayesian model for the parametric family $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $ in which $ \theta $ has prior $ \nu $.

The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has $ n _ {i} $ patients in hospital $ i $, $ i = 1 \dots I $. One might model the number of failures $ X _ {i} $ in hospital $ i $ as a binomial distribution, with success probability depending on the hospital. And one might wish to view the $ I $ binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as

$$ \tag{a1 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim { \mathop{\rm Beta} } ( a, b ) \textrm{ iid } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

Here, the $ \theta _ {i} $ are unobserved, or latent, variables. If the distribution $ G $ was degenerate, then the $ \theta _ {i} $ would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when $ G $ is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital $ i $.

Consider now the problem of prediction of the number of successes for a new hospital, indexed $ I + 1 $. A disadvantage of the model (a1) is that if the $ \theta _ {i} $ are independent and identically drawn from a distribution which is not a Beta, then even as $ I \rightarrow \infty $, the predictive distribution of $ X _ {I + 1 } $ based on the (incorrect) model (a1) need not converge to the actual predictive distribution of $ X _ {I + 1 } $. An alternative model, using a mixture of Dirichlet processes prior, would be written as

$$ \tag{a2 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim P \textrm{ iid } , $$

$$ P \sim {\mathcal D} _ {M \cdot { \mathop{\rm Beta} } ( a,b ) } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

The model (a2) does not have the defect suffered by (a1), because the support of the distribution on $ P $ is the set of all distributions concentrated in the interval $ [0,1] $.

It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [a3] and C.A. Bush and S.N. MacEachern [a2].

The parameter $ M $ plays an interesting role. When $ M $ is small, then, with high probability, the $ \theta _ {i} $ are all equal, so that, in effect, one is working with the model in which the $ X _ {i} $ are independent binomial samples with the same success probability. On the other hand, when $ M $ is large, the model (a2) is very close to (a1).

It is interesting to note that when $ M $ is large and the distribution $ G $ is degenerate, then the measure on $ P $ is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution $ G $ is degenerate, the parameter $ M $ determines the extent to which data from other hospitals is used when making an inference about hospital $ I $, and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.

References

[a1]	C. Antoniak, "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems" Ann. Statist. , 2 (1974) pp. 1152–1174
[a2]	C.A. Bush, S.N. MacEachern, "A semi-parametric Bayesian model for randomized block designs" Biometrika , 83 (1996) pp. 275–285
[a3]	M. Escobar, M. West, "Bayesian density estimation and inference using mixtures" J. Amer. Statist. Assoc. , 90 (1995) pp. 577–588
[a4]	T.S. Ferguson, "A Bayesian analysis of some nonparametric problems" Ann. Statist. , 1 (1973) pp. 209–230
[a5]	T.S. Ferguson, "Prior distributions on spaces of probability measures" Ann. Statist. , 2 (1974) pp. 615–629
[a6]	J. Sethuraman, "A constructive definition of Dirichlet priors" Statistica Sinica , 4 (1994) pp. 639–650

How to Cite This Entry:
Dirichlet process. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Dirichlet_process&oldid=13886

This article was adapted from an original article by H. DossS.N. MacEachern (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article

Navigation

Tools

Namespaces

Variants

Views

Actions

Difference between revisions of "Dirichlet process"

Latest revision as of 19:35, 5 June 2020

References

@@ Line 1: / Line 1: @@
+<!--
+d1102101.png
+$#A+1 = 104 n = 0
+$#C+1 = 104 : ~/encyclopedia/old_files/data/D110/D.1100210 Dirichlet process
+Automatically converted into TeX, above some diagnostics.
+Please remove this comment and the {{TEX|auto}} line below,
+if TeX found to be correct.
+-->
+{{TEX|auto}}
+{{TEX|done}}
 The Dirichlet process provides one means of placing a [[Probability distribution|probability distribution]] on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also [[Bayesian approach|Bayesian approach]]). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [[#References|[a4]]], [[#References|[a5]]].
-The Dirichlet process is indexed by its parameter, a non-null, finite measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102101.png" />. Formally, consider a space <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102102.png" /> with a collection of Borel sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102103.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102104.png" />. The random probability distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102105.png" /> has a Dirichlet process prior distribution with parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102106.png" />, denoted by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102107.png" />, if for every measurable partition <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102108.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102109.png" /> the random vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021010.png" /> has the Dirichlet distribution with parameter vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021011.png" />.
+The Dirichlet process is indexed by its parameter, a non-null, finite measure  $  \alpha $.
+Formally, consider a space  $  {\mathcal X} $
+with a collection of Borel sets  $  {\mathcal B} $
+on  $  {\mathcal X} $.
+The random probability distribution  $  P $
+has a Dirichlet process prior distribution with parameter  $  \alpha $,
+denoted by  $  {\mathcal D} _  \alpha  $,
+if for every measurable partition  $  \{ A _ {1} \dots A _ {m} \} $
+of  $  {\mathcal X} $
+the random vector  $  ( P ( A _ {1} ) \dots P ( A _ {m} ) ) $
+has the Dirichlet distribution with parameter vector  $  ( \alpha ( A _ {1} ) \dots \alpha ( A _ {m} ) ) $.
-When a prior distribution is put on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021012.png" />, then for every measurable subset <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021013.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021014.png" />, the quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021015.png" /> is a random variable. Then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021016.png" /> is a probability measure on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021017.png" />. From the definition one sees that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021018.png" />, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021019.png" />.
+When a [[prior distribution]] is put on  $  {\mathcal X} $,
+then for every measurable subset  $  A $
+of  $  {\mathcal X} $,
+the quantity  $  P ( A ) $
+is a random variable. Then  $  \alpha _ {0} = {\alpha / {\alpha ( {\mathcal X} ) } } $
+is a probability measure on  $  {\mathcal X} $.
+From the definition one sees that if  $  P \sim {\mathcal D} _  \alpha  $,
+then  $  {\mathsf E} P ( A ) = \alpha _ {0} ( A ) $.
-An alternative representation of the Dirichlet process is given in [[#References|[a6]]]: Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021020.png" /> be independent and identically distributed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021021.png" /> random variables, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021022.png" /> be a sequence of independent and identically distributed random variables with distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021023.png" />, and independent of the random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021024.png" />. Define <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021025.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021026.png" />. The random distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021027.png" /> has the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021028.png" />. Here, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021029.png" /> represents the point mass at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021030.png" />. This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021031.png" />. For example, as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021032.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021033.png" /> converges to the point mass at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021034.png" /> (in the [[Weak topology|weak topology]] induced by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021035.png" />); and as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021036.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021037.png" /> converges to the random distribution which is degenerate at a point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021038.png" />, whose location has distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021039.png" />.
+An alternative representation of the Dirichlet process is given in [[#References|[a6]]]: Let  $  B _ {1} , B _ {2} , \dots $
+be independent and identically distributed  $  { \mathop{\rm Beta} } ( 1, \alpha ( {\mathcal X} ) ) $
+random variables, and let  $  V _ {1} , V _ {2} , \dots $
+be a sequence of independent and identically distributed random variables with distribution  $  \alpha _ {0} ( A ) $,
+and independent of the random variables  $  B $.
+Define  $  B _ {0} = 0 $,
+and  $  P _ {i} = B _ {i} \prod _ {j = 0 }  ^ {i - 1 } ( 1 - B _ {j} ) $.
+The random distribution  $  \sum _ {i = 1 }   ^  \infty  P _ {i} \delta _ {V _ {i}  } $
+has the distribution  $  {\mathcal D} _  \alpha  $.
+Here,  $  \delta _ {a} $
+represents the point mass at  $  a $.
+This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure  $  \alpha $.
+For example, as  $  \alpha ( {\mathcal X} ) \rightarrow \infty $,
+$  {\mathcal D} _  \alpha  $
+converges to the point mass at  $  \alpha _ {0} $(
+in the [[Weak topology|weak topology]] induced by  $  {\mathcal B} $);
+and as  $  \alpha ( {\mathcal X} ) \rightarrow 0 $,
+$  {\mathcal D} _  \alpha  $
+converges to the random distribution which is degenerate at a point  $  V $,
+whose location has distribution  $  \alpha _ {0} $.
-The Dirichlet process is conjugate, in that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021040.png" />, and data points <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021041.png" /> independent and identically drawn from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021042.png" /> are observed, then the conditional distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021043.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021044.png" /> is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021045.png" />. This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.
+The Dirichlet process is conjugate, in that if  $  P \sim {\mathcal D} _  \alpha  $,
+and data points  $  X _ {1} \dots X _ {n} $
+independent and identically drawn from  $  P $
+are observed, then the conditional distribution of  $  P $
+given  $  X _ {1} \dots X _ {n} $
+is  $  {\mathcal D} _ {\alpha + \sum _ {i = 1 }   ^ {n} \delta _ {X _ {i}  } } $.
+This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.
-An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021046.png" />. Suppose that for every <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021047.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021048.png" /> is a positive constant, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021049.png" />. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021050.png" /> is a probability distribution on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021051.png" />, and if, first, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021052.png" /> is chosen from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021053.png" />, and then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021054.png" /> is chosen from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021055.png" />, one says that the prior on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021056.png" /> is a mixture of Dirichlet processes (with parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021057.png" />). A reference for this is [[#References|[a1]]]. Often, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021058.png" />, i.e., the constants <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021059.png" /> do not depend on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021060.png" />. In this case, large values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021061.png" /> indicate that the prior on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021062.png" /> is  "concentrated around the parametric family aq,0qQ" . More precisely, as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021063.png" />, the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021064.png" /> converges to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021065.png" />, the standard Bayesian model for the parametric family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021066.png" /> in which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021067.png" /> has prior <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021068.png" />.
+An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions  $  \{ {\alpha _ {\theta,0 }  } : {\theta \in \Theta } \} $.
+Suppose that for every  $  \theta \in \Theta $,
+$  \alpha _  \theta  ( {\mathcal X} ) $
+is a positive constant, and let  $  \alpha _  \theta  = \alpha _  \theta  ( {\mathcal X} ) \cdot \alpha _ {\theta,0 }  $.
+If  $  \nu $
+is a probability distribution on  $  \Theta $,
+and if, first,  $  \theta $
+is chosen from  $  \nu $,
+and then  $  P $
+is chosen from  $  {\mathcal D} _ {\alpha _  \theta   } $,
+one says that the prior on  $  P $
+is a mixture of Dirichlet processes (with parameter  $  ( \{ \alpha _  \theta  \} _ {\theta \in \Theta }  , \nu ) $).
+A reference for this is [[#References|[a1]]]. Often,  $  \alpha _  \theta  ( {\mathcal X} ) \equiv M $,
+i.e., the constants  $  \alpha _  \theta  ( {\mathcal X} ) $
+do not depend on  $  \theta $.
+In this case, large values of  $  M $
+indicate that the prior on  $  P $
+is  "concentrated around the parametric family aq,0qQ" . More precisely, as  $  M \rightarrow \infty $,
+the distribution of  $  P $
+converges to  $  \int {\alpha _ {\theta,0 }  }  {\nu ( d \theta ) } $,
+the standard Bayesian model for the parametric family  $  \{ {\alpha _ {\theta,0 }  } : {\theta \in \Theta } \} $
+in which  $  \theta $
+has prior  $  \nu $.
-The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021069.png" /> patients in hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021070.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021071.png" />. One might model the number of failures <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021072.png" /> in hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021073.png" /> as a [[Binomial distribution|binomial distribution]], with success probability depending on the hospital. And one might wish to view the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021074.png" /> binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as
+The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has  $  n _ {i} $
+patients in hospital  $  i $,
+$  i = 1 \dots I $.
+One might model the number of failures  $  X _ {i} $
+in hospital  $  i $
+as a [[Binomial distribution|binomial distribution]], with success probability depending on the hospital. And one might wish to view the  $  I $
+binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021075.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a1)</td></tr></table>
+$$ \tag{a1 }
+\textrm{ given  }  \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021076.png" /></td> </tr></table>
+$$
+\theta _ {i} \sim { \mathop{\rm Beta} } ( a, b )  \textrm{ iid  } ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021077.png" /></td> </tr></table>
+$$
+( a, b ) \sim G ( \cdot, \cdot ) .
+$$
-Here, the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021078.png" /> are unobserved, or latent, variables. If the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021079.png" /> was degenerate, then the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021080.png" /> would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021081.png" /> is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021082.png" />.
+Here, the  $  \theta _ {i} $
+are unobserved, or latent, variables. If the distribution  $  G $
+was degenerate, then the  $  \theta _ {i} $
+would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when  $  G $
+is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital  $  i $.
-Consider now the problem of prediction of the number of successes for a new hospital, indexed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021083.png" />. A disadvantage of the model (a1) is that if the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021084.png" /> are independent and identically drawn from a distribution which is not a Beta, then even as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021085.png" />, the predictive distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021086.png" /> based on the (incorrect) model (a1) need not converge to the actual predictive distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021087.png" />. An alternative model, using a mixture of Dirichlet processes prior, would be written as
+Consider now the problem of prediction of the number of successes for a new hospital, indexed  $  I + 1 $.
+A disadvantage of the model (a1) is that if the  $  \theta _ {i} $
+are independent and identically drawn from a distribution which is not a Beta, then even as  $  I \rightarrow \infty $,
+the predictive distribution of  $  X _ {I + 1 }  $
+based on the (incorrect) model (a1) need not converge to the actual predictive distribution of  $  X _ {I + 1 }  $.
+An alternative model, using a mixture of Dirichlet processes prior, would be written as
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021088.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a2)</td></tr></table>
+$$ \tag{a2 }
+\textrm{ given  }  \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021089.png" /></td> </tr></table>
+$$
+\theta _ {i} \sim P  \textrm{ iid  } ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021090.png" /></td> </tr></table>
+$$
+P \sim {\mathcal D} _ {M \cdot { \mathop{\rm Beta}  } ( a,b ) } ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021091.png" /></td> </tr></table>
+$$
+( a, b ) \sim G ( \cdot, \cdot ) .
+$$
-The model (a2) does not have the defect suffered by (a1), because the support of the distribution on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021092.png" /> is the set of all distributions concentrated in the interval <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021093.png" />.
+The model (a2) does not have the defect suffered by (a1), because the support of the distribution on  $  P $
+is the set of all distributions concentrated in the interval  $  [0,1] $.
 It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [[#References|[a3]]] and C.A. Bush and S.N. MacEachern [[#References|[a2]]].
-The parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021094.png" /> plays an interesting role. When <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021095.png" /> is small, then, with high probability, the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021096.png" /> are all equal, so that, in effect, one is working with the model in which the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021097.png" /> are independent binomial samples with the same success probability. On the other hand, when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021098.png" /> is large, the model (a2) is very close to (a1).
+The parameter  $  M $
+plays an interesting role. When  $  M $
+is small, then, with high probability, the  $  \theta _ {i} $
+are all equal, so that, in effect, one is working with the model in which the  $  X _ {i} $
+are independent binomial samples with the same success probability. On the other hand, when  $  M $
+is large, the model (a2) is very close to (a1).
-It is interesting to note that when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021099.png" /> is large and the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210100.png" /> is degenerate, then the measure on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210101.png" /> is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210102.png" /> is degenerate, the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210103.png" /> determines the extent to which data from other hospitals is used when making an inference about hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210104.png" />, and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.
+It is interesting to note that when  $  M $
+is large and the distribution  $  G $
+is degenerate, then the measure on  $  P $
+is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution  $  G $
+is degenerate, the parameter  $  M $
+determines the extent to which data from other hospitals is used when making an inference about hospital  $  I $,
+and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.
 ====References====
 <table><TR><TD valign="top">[a1]</TD> <TD valign="top">  C. Antoniak,   "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 1152–1174</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.A. Bush,   S.N. MacEachern,   "A semi-parametric Bayesian model for randomized block designs"  ''Biometrika'' , '''83'''  (1996)  pp. 275–285</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  M. Escobar,   M. West,   "Bayesian density estimation and inference using mixtures"  ''J. Amer. Statist. Assoc.'' , '''90'''  (1995)  pp. 577–588</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  T.S. Ferguson,   "A Bayesian analysis of some nonparametric problems"  ''Ann. Statist.'' , '''1'''  (1973)  pp. 209–230</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  T.S. Ferguson,   "Prior distributions on spaces of probability measures"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 615–629</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  J. Sethuraman,   "A constructive definition of Dirichlet priors"  ''Statistica Sinica'' , '''4'''  (1994)  pp. 639–650</TD></TR></table>