12 Choosing priors in Bayesian analysis

12.1 Choosing a prior

12.1.1 Prior as part of the model

It is essential in a Bayesian analysis to specify your prior uncertainty about the model parameters. Note that this is simply part of the modelling process! Thus in a Bayesian approach the data analyst needs to be more explicit about all modelling assumptions.

Typically, when choosing a suitable prior distribution we consider the overall form (shape and domain) of the distribution as well as its key characteristics such as the mean and variance. As we have learned the precision (inverse variance) of the prior may often be viewed as implied sample size.

For large sample size \(n\) the posterior mean converges to the maximum likelihood estimate (and the posterior distribution to normal distribution centered around the MLE), so for large \(n\) we may ignore specifying a prior.

However, for small \(n\) it is essential that a prior is specified. In non-Bayesian approaches this prior is still there but it is either implicit (maximum likelihood estimation) or specified via a penality (penalised maximum likelihood estimation).

12.1.2 Some guidelines

So the question remains what are good ways to choose a prior? Two useful ways are:

Use a weakly informative prior. This means that you do have an idea (even if only vague) about the suitable values of the parameter of interest, and you use a corresponding prior (for example with moderate variance) to model the uncertainty. This acknowledges that there are no uninformative priors and but also aims that the prior does not dominate the likelihood (i.e. the data). The result is a weakly regularised estimator. Note that it is often desirable that the prior adds information (if only a little) so that it can act as a regulariser.
Empirical Bayes methods can often be used to determine one or all of the hyperparameters (i.e. the parameters in the prior) from the observed data. There are several ways to do this, one of them is to tune the shrinkage parameter \(\lambda\) to achieve minimum MSE. We discuss this further below.

Furthermore, there also exist many proposals advocating so-called “uninformative priors” or “objective priors”. However, there are no actually unformative priors, since a prior distribution that looks uninformative (i.e. “flat”) in one coordinate system can be informative in another — this is a simple consequence of the rule for transformation of probability densities. As a result, often the suggested objective priors are in fact improper, i.e. are not actually probability distributions!

12.2 Default priors or uninformative priors

Objective or for default priors are attempts 1) to automatise specification of a prior and 2) to find uniformative priors.

12.2.1 Jeffreys prior

The most well-known non-informative prior is given by a proposal by Harold Jeffreys (1891–1989) in 1946 ¹⁷.

Specifically, this prior is constructed from the expected Fisher information and thus promises automatic construction of objective uninformative priors using the likelihood: \[ p(\boldsymbol \theta) \propto \sqrt{\det \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)} \]

The reasoning underlying this prior is invariance against transformation of the coordinate system of the parameters.

For the Beta-Binomial model the Jeffreys prior corresponds to \(\text{Beta}(\frac{1}{2}, \frac{1}{2})\). Note this is not the uniform distribution but a U-shaped prior.

For the normal-normal model it corresponds to the flat improper prior \(p(\mu) =1\).

For the IG-normal model the Jeffreys prior is the improper prior \(p(\sigma^2) = \frac{1}{\sigma^2}\).

This already illustrates the main problem with this type of prior – namely that it often is improper, i.e. the prior distribution is not actually a probability distribution (i.e. the density does not integrate to 1).

Another issue is that Jeffreys priors are usually not conjugate which complicates the update from the prior to the posterior.

Furthermore, if there are multiple parameters (\(\boldsymbol \theta\) is a vector) then Jeffreys priors do not usually lead to sensible priors.

12.2.2 Reference priors

An alternative to Jeffreys priors are the so-called reference priors developed by Bernardo (1979) ¹⁸. This type of priors aims to choose the prior such that there is maximal “correlation” between the data and the parameter. More precisely, the mutual information between \(\theta\) and \(x\) is maximised (i.e. the the expected KL divergence between the posterior and prior distribution). The underlying motivation is that the data and parameters should be maximally linked (thereby minimising the influence of the prior).

For univariate settings the reference priors are identical to Jeffreys priors. However, reference prior also provide reasonable priors in multivariate settings.

In both Jeffreys’ and the reference prior approach the choice of prior is by expectation over the data, i.e. not for the specific data set at hand (this can be seen both as a positive and negative!).

12.3 Empirical Bayes

In empirical Bayes the data analysist specifies a family of prior distribution (say a Beta distribution with free parameters), and then the data at hand are used to find an optimal choise for the hyper-parameters (hence the name “empirical”). Thus the hyper-parameters are not specified but themselves estimated.

12.3.1 Type II maximum likelihood

In particular, assuming data \(D\), a likelihood \(p(D|\boldsymbol \theta)\) for some model with parameters \(\boldsymbol \theta\) as well as a prior \(p(\boldsymbol \theta| \lambda)\) for \(\boldsymbol \theta\) with hyper-parameter \(\lambda\) the marginal likelihood now depends on \(\lambda\): \[ p(D | \lambda) = \int_{\boldsymbol \theta} p(D|\boldsymbol \theta) p(\boldsymbol \theta| \lambda) d\boldsymbol \theta \] We can therefore use maximum (marginal) likelihood find optimal values of \(\lambda\) given the data.

Since maximum-likelihood is used in a second level step (the hyper-parameters) this type of empirical Bayes is also often called “type II maximum likelihood”.

12.3.2 Shrinkage estimation using empirical risk minimisation

An alternative (but related) way to estimate hyper-parameters is by minimising the empirical risk.

In the examples for Bayesian estimation that we have considered so far the posterior mean of the parameter of interest was obtained by linear shrinkage \[ \hat\theta_{\text{shrink}} = \text{E}( \theta | D) = \lambda \theta_0 + (1-\lambda) \hat\theta_{\text{ML}} \] of the MLE \(\hat\theta_{\text{ML}}\) towards the prior mean \(\theta_0\), with shrinkage intensity \(\lambda=\frac{k_0}{k_0}\) determined by the ration of the prior and posterior concentration parameters \(k_0\) and \(k_1\).

The resulting point estimate \(\hat\theta_{\text{shrink}}\) is called shrinkage estimate and is a convex combination of \(\theta_0\) and \(\hat\theta_{\text{ML}}\). The prior mean \(\theta_0\) is also called the “target”.

The hyperparameter in this setting is \(k_0\) (linked to the precision of the prior) and or equivalently the shrinkage intensity \(\lambda\).

An optimal value for \(\lambda\) can be obtained by minimising the mean squared error of the estimator \(\hat\theta_{\text{shrink}}\).

In particular, by construction, the target \(\theta_0\) has low or even zero variance but non-vanishing and potentially large bias, whereas the MLE \(\hat\theta_{\text{ML}}\) will have low or zero bias but a substantial variance. By combinining these two estimators with opposite properties the aim is to achieve a bias-variance tradeoff so that the resulting estimator \(\hat\theta_{\text{shrink}}\) has lower MSE than either \(\theta_0\) and \(\hat\theta_{\text{ML}}\).

Specifically, the aim is to find \[ \lambda^{\star} = \underset{\lambda}{\arg \min \ } \text{E}\left( ( \theta - \hat\theta_{\text{shrink}} )^2\right) \]

It turns out that this can be minimised without knowing the actual true value of \(\theta\) and the result for an unbiased \(\hat\theta_{\text{ML}}\) is \[ \lambda^{\star} = \frac{\text{Var}(\hat\theta_{\text{ML}})}{\text{E}( (\hat\theta_{\text{ML}} - \theta_0)^2 )} \] Hence, the shrinkage intensity will be small if the variance of the MLE is small and/or if the target and the MLE differ substantially. On the other hand, if the variance of the MLE is large and/or the target is close to the MLE the shrinkage intensity will be large.

Choosing the shrinkage parameter by optimising expected risk (here mean squared error) is also a form empirical Bayes.

Example 12.1 James-Stein estimator:

Empirical risk minimisation to estimate the shrinkage parameter of the normal-normal model for a single observation yields the James-Stein estimator (1955).

Specifically, James and Stein propose the following estimate for the multivariate mean \(\boldsymbol \mu\) of using a single sample \(\boldsymbol x\) drawn from the multivariate normal \(N_d(\boldsymbol \mu, \boldsymbol I)\): \[ \hat{\boldsymbol \mu}_{JS} = \left(1-\frac{d-2}{||\boldsymbol x||^2}\right) \boldsymbol x \] Here, we recognise \(\hat{\boldsymbol \mu}_{ML} = \boldsymbol x\), \(\boldsymbol \mu_0=0\) and shrinkage intensity \(\lambda^{\star}=\frac{d-2}{||\boldsymbol x||^2}\).

Efron and Morris (1972) and Lindley and Smith (1972) later generalised the James-Stein estimator to the case of multiple observations \(\boldsymbol x_1, \ldots \boldsymbol x_n\) and target \(\boldsymbol \mu_0\), yielding an empirical Bayes estimate of \(\mu\) based on the normal-normal model.

11 Bayesian model comparison

13 Optimality properties and summary