11 Bayesian model comparison

11.1 Marginal likelihood as model likelihood

11.1.1 Simple and composite models

In the introduction of the Bayesian learning we already encountered the marginal likelihood $p(D | M)$ of a model class $M$ in the denominator of Bayes’ rule: \[ p(\boldsymbol \theta| D, M) = \frac{p(\boldsymbol \theta| M) p(D | \boldsymbol \theta, M) }{p(D | M)} \] Computing this marginal likelihood is different for simple and composite models.

A model is called “simple” if it directly corresponds to a specific distribution, say, a normal distribution with fixed mean and variance, or a binomial distribution with a given probability for the two classes. Thus, a simple model is a point in the model space described by the parameters of a distribution family (e.g. $\mu$ and $\sigma^2$ for the normal family $N(\mu, \sigma^2$). For a simple model $M$ the density $p(D | M)$ corresponds to standard likelihood of $M$ and there are no free parameters.

On the other hand, a model is “composite” if it is composed of simple models. This can be a finite set, or it can be comprised of infinite number of simpple models. Thus a composite model represent a model class. For example, a normal distribution with a given mean but unspecified variance, or a binomial model with unspecified class probability, is a composite model.

If $M$ is a composite model, with the underlying simple models indexed by a parameter $\boldsymbol \theta$, the likelihood of the model is obtained by marginalisation over $\boldsymbol \theta$: \[ \begin{split} p(D | M) &= \int_{\boldsymbol \theta} p(D | \boldsymbol \theta, M) p(\boldsymbol \theta| M) d\boldsymbol \theta\\ &= \int_{\boldsymbol \theta} p(D , \boldsymbol \theta| M) d\boldsymbol \theta\\ \end{split} \] i.e. we integrate over all parameter values $\boldsymbol \theta$.

If the distribution over the parameter $\boldsymbol \theta$ of a model is strongly concentrated around a specific value $\boldsymbol \theta_0$ then the composite model degenerates to a simple point model, and the marginal likelihood becomes the likelihood of the parameter $\boldsymbol \theta_0$ under that model.

Example 11.1 Beta-binomial distribution:

Assume that likelihood is binomial with mean parameter $\theta$. If $\theta$ follows a Beta distribution then the marginal likelihood with $\theta$ integrated out is the beta-binomial distribution (see also Worksheet B3). This is an example of a compound probability distribution.

11.1.2 Log-marginal likelihood as penalised maximum log-likelihood

By rearranging Bayes’ rule we see that \[ \log p(D | M) = \log p(D | \boldsymbol \theta, M) - \log \frac{ p(\boldsymbol \theta| D, M) }{p(\boldsymbol \theta| M) } \] The above is valid for all $\boldsymbol \theta$.

Assuming concentration of the posterior around the MLE $\hat{\boldsymbol \theta}_{\text{ML}}$ we will have $p(\hat{\boldsymbol \theta}_{\text{ML}} | D, M)> p(\hat{\boldsymbol \theta}_{\text{ML}}| M)$ and thus \[ \log p(D | M) = \underbrace{\log p(D | \hat{\boldsymbol \theta}_{\text{ML}}, M)}_{\text{maximum log-likelihood}} - \underbrace{ \log \frac{ p( \hat{\boldsymbol \theta}_{\text{ML}} | D, M) }{p( \hat{\boldsymbol \theta}_{\text{ML}}| M) } }_{\text{penalty > 0}} \] Therefore, the log-marginal likelihood is essentially a penalised version of the maximum log-likelihood, and the penalty depends on the concentration of the posterior around the MLE

11.1.3 Model complexity and Occams razor

Intriguingly, the penality implicit in the log-marginal likelihood is linked to the complexity of the model, in particular to the number of parameters of $M$. We will see this directly in the Schwarz approximation of the log-marginal likelihood discussed below.

Thus, the averaging over $\boldsymbol \theta$ in the marginal likelihood has the effect of automatically penalising complex models. Therefore, when comparing models using the marginal likelihood a complex model may be ranked below simpler models. In contrast, when selecting a model by comparing maximum likelihood directly the model with the highest number of parameters always wins over simpler models. Hence, the penalisation implicit in the marginal likelihood prevents overfitting that occurs with maximum likelihood.

The principle of preferring a less complex model is called Occam’s razor or the law of parsimony.

When choosing models a simpler model is often preferable over a more complex model, because the simpler model is typically better suited to both explaining the currently observed data as well as future data, whereas a complex model will typically only excel in fitting the current data but will perform poorly in prediction.

11.2 The Bayes factor for comparing two models

11.2.1 Definition of the Bayes factor

The Bayes factor is the ratio of the likelihoods of the two models: \[ B_{12} = \frac{p(D | M_1)}{p(D | M_2)} \]

The log-Bayes factor $\log B_{12}$ is also called the weight of evidence for $M_1$ over $M_2$.

11.2.2 Bayes theorem in terms of the Bayes factor

We would like to compare two models $M_1$ and $M_2$. Before seeing data $D$ we can check their Prior odds (= ratio of prior probabilities of the models $M_1$ and $M_2$):
\[\frac{\text{Pr}(M_1)}{\text{Pr}(M_2)}\]

After seeing data $D = \{x_1, \ldots, x_n\}$ we arrive at the Posterior odds (= ratio of posterior probabilities): \[\frac{\text{Pr}(M_1 | D)}{\text{Pr}(M_2 | D)}\]

Using Bayes Theorem $\text{Pr}(M_i | D) = \text{Pr}(M_i) \frac{p(D | M_i) }{p(D)}$ we can rewrite the posterior odds as \[ \underbrace{\frac{\text{Pr}(M_1 | D)}{\text{Pr}(M_2 | D)}}_{\text{posterior odds}} = \underbrace{\frac{p(D | M_1)}{p(D | M_2)}}_{\text{Bayes factor $B_{12}$}} \, \underbrace{\frac{\text{Pr}(M_1)}{\text{Pr}(M_2)}}_{\text{prior odds}} \]

The Bayes factor is the multiplicative factor that updates the prior odds to the posterior odds.

On the log scale we see that

\[ \text{log-posterior odds = weight of evidence + log-prior odds} \]

11.2.3 Scale for the Bayes factor

Following Harold Jeffreys (1961) ¹⁴ one may interpret the strength of the Bayes factor as follows:

$B_{12}$	$\log B_{12}$	evidence in favour of $M_1$ versus $M_2$
> 100	> 4.6	decisive
10 to 100	2.3 to 4.6	strong
3.2 to 10	1.16 to 2.3	substantial
1 to 3.2	0 to 1.16	not worth more than a bare mention

More recently, Kass and Raftery (1995) ¹⁵ proposed to use the following slightly modified scale:

$B_{12}$	$\log B_{12}$	evidence in favour of $M_1$ versus $M_2$
> 150	> 5	very strong
20 to 150	3 to 5	strong
3 to 20	1 to 3	positive
1 to 3	0 to 1	not worth more than a bare mention

11.2.4 Bayes factor versus likelihood ratio

If both $M_1$ and $M_2$ are simple models then the Bayes factor is identical to the likelihood ratio of the two models.

However, if one of the two models is composite then the Bayes factor and the generalised likelihood ratio differ: In the Bayes factor the representative of a composite model is the model average of the simple models indexed by $\boldsymbol \theta$, with weights taken from the prior distribution over the simple models contained in $M$. In contrast, in the generalised likelihood ratio statistic the representative of a composite model is chosen by maximisation.

Thus, for composite models, the Bayes factor does not equal the corresponding generalised likelihood ratio statistic. In fact, the key difference is that the Bayes factor is a penalised version of the likelihood ratio, with the penality depending on the difference in complexity (number of parameters) of the two models

11.3 Approximate computations

The marginal likelihood and the Bayes factor can be difficult to compute in practise. Therefore, a number of approximations have been developed. The most important is the so-called Schwarz (1978) approximation of the log-marginal likelihood. It is used to approximate the log-Bayes factor and also yields the BIC (Bayesian information criterion) which can be interpreted as penalised maximum likelihood.

11.3.1 Schwarz (1978) approximation of log-marginal likelihood

The logarithm of the marginal likelihood of a model can be approximated following Schwarz (1978) ¹⁶ as follow: \[ \log p(D | M) \approx l_n^M(\hat{\boldsymbol \theta}_{ML}^{M}) - \frac{1}{2} d_M \log n \] where $d_M$ is the dimension of the model $M$ (number of parameters in $\boldsymbol \theta$ belonging to $M$) and $n$ is the sample size and $\hat{\boldsymbol \theta}_{ML}^{M}$ is the MLE. For a simple model $d_M=0$ so then there is no approximation as in this case the marginal likelihood equals the likelihood.

The above formula can be obtained by quadratic approximation of the likelihood assuming large $n$ and assuming that the prior is locally uniform around the MLE. The Schwarz (1978) approximation is therefore a special case of a Laplace approximation.

Note that the approximation is the maximum log-likelihood minus a penalty that depends on the model complexity (as measured by dimension $d$), hence this is an example of penalised ML! Also note that the distribution over the parameter $\boldsymbol \theta$ is not required in the approximation.

11.3.2 Bayesian information criterion (BIC)

The BIC (Bayesian information criterion) of the model $M$ is the approximated log-marginal likelihood times the factor -2:

\[ BIC(M) = -2 l_n^M(\hat{\boldsymbol \theta}_{ML}^{M}) + d_M \log n \]

Thus, when comparing models one aimes to maximise the marginal likelihood or, as approximation, minimise the BIC.

The reason for the factor “-2” is simply to have a quantity that is on the same scale as the Wilks log likelihood ratio. Some people / software packages also use the factor “2”.

11.3.3 Approximating the weight of evidence (log-Bayes factor) with BIC

Using BIC (twice) the log-Bayes factor can be approximated as \[ \begin{split} 2 \log B_{12} &\approx -BIC(M_1) + BIC(M_2) \\ &=2 \left( l_n^{M_{1}}(\hat{\boldsymbol \theta}_{ML}^{M_{1}}) - l_n^{M_{2}}(\hat{\boldsymbol \theta}_{ML}^{M_{2}}) \right) - \log(n) (d_{M_{1}}-d_{M_{2}}) \\ \end{split} \] i.e. it is the penalised log-likelihood ratio of model $M_1$ vs. $M_2$.

11.4 Bayesian testing using false discovery rates

We introduce False Discovery Rates (FDR) as a Bayesian method to distinguish a null model from an alternative model. This is closely linked with classical frequentist multiple testing procedures.

11.4.1 Setup for testing a null model $H_0$ versus an alternative model $H_A$

We consider two models:

$H_0:$ null model, with density $f_0(x)$ and distribution $F_0(x)$

$H_A:$ alternative model, with density $f_A(x)$ and distribution $F_A(x)$

Aim: given observations $x_1, \ldots, x_n$ we would like to decide for each $x_i$ whether it belongs to $H_0$ or $H_A$.

This is done by a critical decision threshold $x_c$: if $x_i > x_c$ then $x_i$ is called “significant” and otherwise called “not significant”.

In classical statistics one of the the most widely used approach to find the decision threshold is by computing $p$-values from the $x_i$ (this uses only the null model but not the alternative model), and then thresholding the $p$-values a a certain level (say 5%). If $n$ is large then often the test is modified by adjusting the $p$-values or the threshold (e.g. if Bonferroni correction).

Note that this procedure ignores any information we may have about the alternative model!

11.4.2 Test errors

11.4.2.1 True and false positives and negatives

For any decision threshold $x_c$ we can distinguish the following errors:

False positives (FP), “false alarm”, type I error: $x_i$ belongs to null but is called “significant”
False negative (FN), “miss”, type II error: $x_i$ belongs to alternative, but is called “not significant”

In addition we have:

True positives (TP), “hits”: belongs to alternative and is called “significant”
True negatives (TN), “correct rejections”: belongs to null and is called “not significant”

11.4.2.2 Specificity and Sensitivity

From counts of TP, TN, FN, FP we can derive further quantities:

True Negative Rate TNR, specificity: $TNR= \frac{TN}{TN+FP} = 1- FPR$ with FPR=False Positive Rate = $1-\alpha_I$
True Positive Rate TPR, sensitivity, power, recall: $TPR= \frac{TP}{TP+FN} = 1- FNR$ with FNR=False negative rate = $1-\alpha_{II}$
Accuracy: $ACC = \frac{TP+TN}{TP+TN+FP+FN}$

Another common way to choose the decision threshold $x_d$ in classical statistics is to balance sensitivity/power vs. specificity (maximising both power and specificity, or equivalently, minimising both false positive and false negative rates). ROC curves plot TPR/sensitivity vs. FPR = 1-specificity.

11.4.2.3 FDR and FNDR

It is possible to link the above with the observed counts of TP, FP, TN, FN:

False Discovery Rate (FDR): $FDR = \frac{FP}{FP+TP}$
False Nondiscovery Rate (FNDR): $FNDR = \frac{FN}{TN+FN}$
Positive predictive value (PPV), True Discovery Rate (TDR), precision: $PPV = \frac{TP}{FP+TP} = 1-FDR$
Negative predictive value (NPV): $NPV = \frac{TN}{TN+FN} = 1-FNDR$

In order to choose the decision threshold it is natural to balance FDR and FDNR (or PPV and NPV), by minimising both FDR and FNDR or maximising both PPV and NPV.

In machine learning it is common to use “precision-recall plots” that plot precision (=PPV, TDR) vs. recall (=power, sensitivity).

11.4.3 Bayesian perspective

11.4.3.1 Two component mixture model

In the Bayesian perspective the problem of choosing the decision threshold is related to computing the posterior probability \[\text{Pr}(H_0 | x_i) , \] i.e. probability of the null model given the observation $x_i$, or equivalently computing \[\text{Pr}(H_A | x_i) = 1- \text{Pr}(H_0 | x_i)\] the probability of the alternative model given the observation $x_i$.

This is done by assuming a mixture model \[ f(x) = \pi_0 f_0(x) + (1-\pi_0) f_A(x) \] where $\pi_0 = \text{Pr}(H_0)$ is the prior probability of $H_0$ and. $\pi_A = 1- \pi_0 = \text{Pr}(H_A)$ the prior probabiltiy of $H_A$.

Note that the weights $\pi_0$ can in fact be estimated from the observations by fitting the mixture distribution to the observations $x_1, \ldots, x_n$ (so it is effectively an empirical Bayes method where the prior is informed by the data).

11.4.3.2 Local FDR

The posterior probability of the null model given a data point is then given by \[\text{Pr}(H_0 | x_i) = \frac{\pi_0 f_0(x_i)}{f(x_i)} = LFDR(x_i)\] This quantity is also known as the local FDR or local False Discovery Rate.

In the given one-sided setup the local FDR is large (close to 1) for small $x$, and will become close to 0 for large $x$. A common decision rule is given by thresholding local false discovery rates: if $LFDR(x_i) < 0.1$ the $x_i$ is called significant.

11.4.3.3 q-values

In correspondence to $p$-values one can also define tail-area based false discovery rates: \[ Fdr(x_i) = \text{Pr}(H_0 | X > x_i) = \frac{\pi_0 F_0(x_i)}{F(x_i)} \]

These are called q-values, or simply False Discovery Rates (FDR). Intriguingly, these also have a frequentist interpretation as adjusted p-values (using a Benjamini-Hochberg adjustment procedure).

11.4.4 Software

There are a number of R packages to compute (local) FDR values:

For example:

and many more.

Using FDR values for screening is especially useful in high-dimensional settings (e.g. when analysing genomic and other high-throughput data).

FDR values have both a Bayesian as well as frequentist interpretation, providing further evidence that good classical statistical methods do have a Bayesian interpretation.

10 Bayesian learning in practise

12 Choosing priors in Bayesian analysis

Statistical Methods: Likelihood, Bayes and Regression