# Factors

This article Marginal Probability. Its use in Bayesian Statistics as the Evidence of Models and Bayes Factors was adapted from an original article by Luis Raul Pericchi, which appeared in StatProb: The Encyclopedia Sponsored by Statistics and Probability Societies. The original article ([http://statprob.com/encyclopedia/MarginalProbabilityItsUseInBayesianStatisticsAsTheEvidenceOfModelsAndBayesFactors3.html StatProb Source], Local Files: pdf | tex) is copyrighted by the author(s), the article has been donated to Encyclopedia of Mathematics, and its further issues are under Creative Commons Attribution Share-Alike License'. All pages from StatProb are contained in the Category StatProb.

2010 Mathematics Subject Classification: Primary: 62F03 Secondary: 62F15 [MSN][ZBL]


Marginal Probability. Its use in Bayesian Statistics as the Evidence of Models and Bayes Factors
Luis Raúl Pericchi, Department of Mathematics and Biostatistics and Bioinformatics Center,

University of Puerto Rico, Rio Piedras, San Juan, Puerto Rico. [1]

\mathbf{Keywords:}Bayes Factors, Evidence of Models, Intrinsic Bayes Factors, Intrinsic Priors, Posterior Model Probabiities

## Definition

Suppose that we have vectors of random variables $[\mathbf{v,w}]=[v_1,v_2,\ldots,v_I,w_1,\ldots,w_J]$ in $\Re^{(I+J)}$. Denote as the \mathbf{joint} density function: $f_{\mathbf{v,w}}$, which obeys:$f_{\mathbf{v,w}}(v,w) \ge 0$ and $\int^{\infty}_{-\infty}\ldots\int^{\infty}_{-\infty} f_{\mathbf{v,w}}(v,w) dv_1\ldots dv_I dw_1\ldots dw_I=1$. Then the probability of the set $[A_v,B_w]$ is given by $P(A_v,B_w)=\int \ldots \int_{A_v,B_w} f_{\mathbf{v,w}}(v,w) \mathbf{dv} \mathbf{dw}.$ The the $f_{\mathbf{v}}(v)=\int^{\infty}_{-\infty}\ldots \int^{\infty}_{-\infty}f_{\mathbf{v,w}}(v,w) dw_1\ldots dw_I.$ The the obtained as, $P(A_v)=\int \ldots \int_{A_v} f_{\mathbf{v}}(v) dv.$ We have assumed that the random variables are continuous. When they are discrete, integrals are substituted by sums. We proceed to present an important application of marginal densities to construct the ''Evidence of the Model'' and marginal probabilities for measuring the ''Bayesian Probability of a Model''. =='"UNIQ--h-1--QINU"' Measuring the Evidence in Favor of a Model == In Statistics, a parametric model, is denoted as $f(x_1,\ldots,x_n|\theta_1,\ldots,\theta_k)$, where $\mathbf{x}=(x_1,\ldots, x_n)$ is the vector of $n$ observations and $\btheta=(\theta_1,\ldots,\theta_k)$ is the vector of $k$ parameters. For instance we may have $n$ observations normally distributed and the vector of parameters is $(\theta_1,\theta_2)$ the location and scale respectively, denoted by $f_{Normal}(\mathbf{x}|\btheta)=\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \theta_2} \exp(-\frac{1}{2 \theta^2_2} (x_i-\theta_1)^2)$. Assume now that there is reason to suspect that the location is zero. As a second example, it may be suspected that the sampling model which usually has been assumed Normally distributed, is instead a Cauchy, $f_{Cauchy}(\mathbf{x}|\btheta)=\prod_{i=1}^n \frac{1}{\pi \theta_2}\frac{1}{(1+(\frac{x_i-\theta_1}{\theta_2})^{2})}$. The first problem is a ''hypothesis test'' denoted by $H_0: \theta_1=0 \mbox{ VS } H_1: \theta_1 \neq 0,$ and the second problem is a ''model selection'' problem: $M_0: f_{Normal} \mbox{ VS } M_1: f_{Cauchy}.$ How to measure the evidence in favor of $H_0$ or $M_0$? Instead of maximizing likelihoods as it is done in traditional significance testing, in Bayesian statistics the central concept is the evidence or marginal probability density $m_j({\bx})=\int f_j({\bx}|\btheta_j) \pi(\btheta_j) d\btheta_j,$ where $j$ denotes either model or hypothesis $j$ and $\pi(\btheta_j)$ denotes the prior for the parameters under model or hypothesis $j$. Marginal probabilities embodies the likelihood of a model or hypothesis in great generality and can be claimed it is the natural probabilistic quantity to compare models. =='"UNIQ--h-2--QINU"' Marginal Probability of a Model == Once the marginal densities of the model j, for $j=1,\ldots,J$ models have been calculated and assuming the prior model probabilities $P(M_j), j=1,\ldots, J$ with $\sum_{j=1}^J P(M_j)=1$ then, using Bayes Theorem, the marginal probability of a model $P(M_j|\bx)$ can be calculated as, $P(M_j|\bx)=\frac{m_j({\bx}) \cdot P(M_j)}{\sum_{i=1}^n m_i({\bx}) \cdot P(M_i)}.$ We have then the following formula for any two models or hypotheses: $\frac{P(M_j|\bx)}{P(M_i|\bx)}= \frac{P(M_j)}{P(M_i)} \times \frac{m_j({\bx})}{m_i({\bx})},$ or in words: Posterior Odds equals Prior Odds times Bayes Factor, where the Bayes Factor of $M_j$ over $M_i$ is $B_{j,i}=\frac{m_j({\bx})}{m_i({\bx})},$ Jeffreys (1961). In contrast to p-values, which have interpretations heavily dependent on the sample size $n$, and its definition is not the same as the scientific question, the posterior probabilities and Bayes Factors address the scientific question: "how probable is model or hypothesis j as compared with model or hypothesis i?", and the interpretation is the same for any sample size, Berger and Pericchi (1996a, 2001). Bayes Factors and Marginal Posterior Model Probabilities have several advantages, like for example large sample consistency, that is as the sample size grows the Posterior Model Probability of the sampling model tends to one. Furthermore, if the goal is to predict future observations $y_f$ it is \mathbf{not} necessary to select one model as the predicting model since we may predict by the so called Bayesian Model Averaging, which if quadratic loss is assumed, the optimal predictor takes the form, $E[Y_f|\bx]= \sum_{j=1}^J E[Y_f|\bx, M_j] \times P(M_j|\bx),$ where $E[Y_f|\bx,M_j]$ is the expected value of a future observation under the model or hypothesis $M_j$. =='"UNIQ--h-3--QINU"' Intrinsic Priors for Model Selection and Hypothesis Testing == Having said some of the advantages of the marginal probabilities of models, the question arises: how to assign the conditional priors $\pi(\theta_j)$? In the two examples above which priors are sensible to use? The problem is \mathbf{not} a simple one since it is not possible to use the usual Uniform priors since then the Bayes Factors are undetermined. To solve this problem with some generality, Berger and Pericchi (1996a,b) introduced the concepts of Intrinsic Bayes Factors and Intrinsic Priors. Start by splitting the sample in two sub-samples $\bx=[\bx(l),\bx(-l)]$ where the training sample $\bx(l)$ is as small as possible such that for $j=1,\ldots,J: 0<m_j(\bx(l))<\infty$. Thus starting with an improper prior $\pi^N(\theta_j)$, which does not integrate to one (for example the Uniform), by using the minimal training sample $\bx(l)$, all the conditional prior densities $\pi(\theta_j|\bx(l))$ \mathbf{become} proper. So we may form the Bayes Factor using the training sample $\bx(l)$ as $B_{ji}(\bx(l))=\frac{m_j(\bx(-l)|\bx(l))}{m_i(\bx(-l)|\bx(l))}.$ This however depends on the particular training sample $\bx(l)$. So some sort of average of Bayes Factor is necessary. In Berger and Pericchi (1996) it is shown that the average should be the arithmetic average. It is also found a theoretical prior that is an approximation to the procedure just described as the sample size grows. This is called an Intrinsic Prior. In the examples above: \mathbf{Example 1}: in the normal case, assuming first that the variance is known $\theta^2_2=\theta^2_{2,0}$ then it turns out that the Intrinsic Prior is Normal centered at the null hypothesis $\theta_1=0$ and with variance $2 \cdot \theta^2_{2,0}$. More generally when the variance is unknown $\pi^I(\theta_1|\theta_2)=\frac{1-\exp(-\theta_1^2/\theta_2^2)}{2 \sqrt{\pi}\cdot (\theta_1^2/\theta_2)}, \mbox{ and } \pi^I(\theta_2)=\frac{1}{\theta_2}.$ It turns out that $\pi^I(\theta_1|\theta_2)$ is a proper density, Berger and Pericchi (1996ab), Pericchi(2005). \mathbf{Example 2}: in the Normal vs Cauchy example, it turns out that the improper prior $\pi^I(\theta_1,\theta_2)=1/\theta_2$ is the appropriate prior for comparing the models, Pericchi (2005). For other examples of Intrinsic Priors see for instance, Berger and Pericchi (1996a, 1996b, 2001), Moreno, Bertolino and Racugno (1998), Pericchi (2005) and Casella and Moreno (2009), among others.