10 Bayesian learning in practise

In this chapter we discuss how three basic problems, namely how to estimate a proportion, the mean and the variance in a Bayesian framework.

10.1 Estimating a proportion using the beta-binomial model

10.1.1 Binomial likelihood

In order to apply Bayes’ theorem we first need to find a suitable likelihood. We use the Bernoulli model as in Example 3.1:

Repeated Bernoulli experiment (binomial model):

Bernoulli data generating process: \[ x \sim \text{Ber}(\theta) \]

$x \in \{0, 1\}$ (e.g. “success” vs. “failure”)
The “success” is indicated by outcome $x=1$ and the “failure” by $x=0$
Parameter: $\theta$ is the probability of “success”
probability mass function (PMF): $\text{Pr}(x=1) = \theta$, $\text{Pr}(x=0) = 1-\theta$
Mean: $\text{E}(x) = \theta$
Variance $\text{Var}(x) = \theta (1-\theta)$

Binomial model $\text{Bin}(n,\theta)$ (sum of $n$ Bernoulli experiments):

$y \in \{0, 1, \ldots, n\} = \sum_{i=1}^n x_i$
Mean: $\text{E}(y) = n \theta$
Variance: $\text{Var}(y) = n \theta (1-\theta)$
Mean of standardised $y$: $\text{E}(y/n) = \theta$
Variance of standardised $y$: $\text{Var}(y/n) = \frac{\theta (1-\theta)}{n}$

Maximum likelihood estimate of $\theta$:

We conduct $n$ Bernoulli trials and observe data $D = \{x_1, \ldots, x_n\}$ with average $\bar{x}$ and $n_1$ successes and $n_2 = n-n_1$ failures.
Binomial likelihood: \[ L(\theta|D) = \begin{pmatrix} n \\ n_1 \end{pmatrix} \theta^{n_1} (1-\theta)^{n_2} \] Note that the binomial coefficient arises as the ordering of the $x_i$ is irrelevant but it may be discarded as is does not contain the parameter $\theta$.
From Example 3.1 we know that the maximum likelihood estimate of the proportion $\theta$ is the frequency \[\hat{\theta}_{ML} = \frac{n_1}{n} = \bar{x}\] Thus, the MLE $\hat{\theta}_{ML}$ can be expressed as an average (of the individual data points). This seemingly trivial fact is important for Bayesian estimation of $\theta$ using linear shrinkage, as will become evident below.

10.1.2 Beta prior distribution

In Bayesian statistics we need not only to specify the data generating process but also a prior distribution over the parameters of the likelihood function.

Therefore, we need to explicitly specify our prior uncertainty about $\theta$.

The parameter $\theta$ has support $[0,1]$. Therefore we may use a beta distribution $\text{Beta}(\alpha_1, \alpha_2)$ as prior for $\theta$ (see the Appendix for properties of this distribution). We will see below that the beta distribution is a natural choice as a prior in conjunction with a binomial likelihood.

The parameters of a prior (here $\alpha_1 \geq 0$ and $\alpha_2 \geq 0$) are also known as the hyperparameters of the model to distinguish them from the parameters of the likelihood function (here $\theta$).

We write for the prior distribution \[ \theta \sim \text{Beta}(\alpha_1, \alpha_2) \] with density \[ p(\theta) = \frac{1}{B(\alpha_1, \alpha_2)} \theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1} \]

In terms of mean parameterisation $\text{Beta}(\mu_0, k_0)$ this corresponds to:

The prior concentration parameter is set to $k_0 = \alpha_1 + \alpha_2$
The prior mean parameter is set to $\mu_0 = \alpha_1 / k_0$.

The prior mean is therefore \[ \text{E}(\theta) = \mu_0 \] and the prior variance \[ \text{Var}(\theta) = \frac{\mu_0 (1-\mu_0)}{k_0 + 1} \]

It is important that this does not actually mean that $\theta$ is random. It only means that we model the uncertainty about $\theta$ using a beta-distributed random variable. The flexibility of the beta distribution allows to accommodate a large variety of possible scenarios for our prior knowledge using just two parameters.

Note the mean and variance of the beta prior and the mean and variance of the standardised binomial variable $y/n$ have the same form. This is further indication that the binomial likelihood and the beta prior are well matched — see the discussion below about “conjugate priors”.

10.1.3 Computing the posterior distribution

After observing data $D = \{x_1, \ldots, x_n\}$ with $n_1$ “successes” and $n_2 = n-n_1$ “failures” we can compute the posterior density over $\theta$ using Bayes’ theorem: \[ p(\theta| D) = \frac{p(\theta) L(\theta | D) }{p(D)} \]

Applying Bayes’ theorem results in the posterior distribution: \[ \theta| D \sim \text{Beta}(\alpha_1+n_2, \alpha_2+n_2) \] with density \[ p(\theta| D) = \frac{1}{B(\alpha_1+n_1, \alpha_2+n_2)} \theta^{\alpha_1+n_1-1} (1-\theta)^{\alpha_2+n_2-1} \] (For a proof see Worksheet B1.)

In the corresponding mean parameterisation $\text{Beta}(\mu_1, k_1)$ this results in the following updates:

The concentration parameter is updated to $k_1 = k_0+n$
The mean parameter is updated to \[ \mu_1 = \frac{\alpha_1 + n_1}{k_1} \] This can be written as \[ \begin{split} \mu_1 & = \frac{\alpha_1}{k_1} + \frac{n_1}{k_1}\\ & = \frac{k_0}{k_1} \frac{\alpha_1}{k_0} + \frac{n}{k_1} \frac{n_1}{n}\\ & = \lambda \mu_0 + (1-\lambda) \hat{\theta}_{ML}\\ \end{split} \] with $\lambda = \frac{k_0}{k_1}$. Hence, $\mu_1$ is a convex combination of the prior mean and the MLE.

Therefore, the posterior mean is \[ \text{E}(\theta | D) = \mu_1 \] and the posterior variance is \[ \text{Var}(\theta | D) = \frac{\mu_1 (1-\mu_1)}{k_1+1 } \]

10.2 Properties of Bayesian learning

The beta-binomial model, even though it is one of the simplest possible models, already allows to observe a number of important features and properties of Bayesian learning. Many of these apply also to other models as we will see later.

10.2.1 Prior acting as pseudodata

In the expression for the mean and variance you can see that the concentration parameter $k_0=\alpha_1 + \alpha_2$ behaves like an implicit sample size connected with the prior information about $\theta$.

Specifically, $\alpha_1$ and $\alpha_2$ act as pseudocounts that influence both the posterior mean and the posterior variance, exactly in the same way as conventional observations.

For example, the larger $k_0$ (and thus the larger $\alpha_1$ and $\alpha_2$) the smaller is the posterior variance, with variance decreasing proportional to the inverse of $k_0$. If the prior is highly concentrated, i.e. if it has low variance and large precision (=inverse variance) then the implicit data size $k_0$ is large. Conversely, if the prior has large variance, then the prior is vague and the implicit data size $k_0$ is small.

Hence, a prior has the same effect as if one would add data — but without actually adding data! This is precisely this why a prior acts as a regulariser and prevents overfitting, because it increases the effective sample size.

Another interpretation is that a prior summarises data that may have been available previously as observations.

10.2.2 Linear shrinkage of mean

In the beta-binomial model the posterior mean is a convex combination (i.e. the weighted average) of the ML estimate and the prior mean as can be seen from the update formula \[ \mu_1 = \lambda \mu_0 + (1-\lambda) \hat{\theta}_{ML} \] with weight $\lambda \in [0,1]$ \[ \lambda = \frac{k_0}{k_1} \,. \] Thus, the posterior mean $\mu_1$ is a linearly adjusted $\hat{\theta}_{ML}$. The factor $\lambda$ is called the shrinkage intensity — note that this is the ratio of the “prior sample size” ($k_0$) and the “effective total sample size” ($k_1$).

This adjustment of the MLE is called shrinkage, because the $\hat{\theta}_{ML}$ is “shrunk” towards the prior mean $\mu_0$ (which is often called the “target”, and sometimes the target is zero, and then the terminology “shrinking” makes most sense).
If the shrinkage intensity is zero ($\lambda = 0$) then the ML point estimator is recovered. This happens when $\alpha_1=0$ and $\alpha_2=0$ or for $n \rightarrow \infty$.

Remark: using maximum likelihood to estimate $\theta$ (for moderate or small $n$) is the same as Bayesian posterior mean estimation using the beta-binomial model with prior $\alpha_1=0$ and $\alpha_2=0$. This prior is extremely “u-shaped” and the implicit prior for the ML estimation. Would you use such a prior intentionally?
If the shrinkage intensity is large ($\lambda \rightarrow 1$) then the posterior mean corresponds to the prior. This happens if $n=0$ or if $k_0$ is very large (implying that the prior is sharply concentrated around the prior mean).
Since the ML estimate $\hat{\theta}_{ML}$ is unbiased the Bayesian point estimate is biased (for finite $n$!). And the bias is induced by the prior mean deviating from the true mean. This is also true more generally as Bayesian learning typically produces biased estimators (but asymptotically they will be unbiased like in ML).
The fact that the posterior mean is a linear combination of the MLE and the prior mean is not a coincidence. In fact, this is true for all distributions that are exponential families, see e.g. Diaconis and Ylvisaker (1979)¹¹. Crucially, exponential families can always be parameterised such that the corresponding MLEs are expressed as averages of functions of the data (more technically: the MLE of the mean parameter in an EF is the average of the canonical statistic). In conjunction with a particular type of prior (conjugate priors, always existing for exponential families, see below) this allows to write the update from the prior to posterior mean as a linear adjustment of the MLE.
Furthermore, it is possible (and indeed quite useful for computational reasons!) to formulate Bayes learning assuming only first and second moments (i.e. without full distributions) and in terms of linear shrinkage, see e.g. Hartigan (1969)¹². The resulting theory is called “Bayes linear statistics” (Goldstein and Wooff, 2007)¹³.

10.2.3 Conjugacy of prior and posterior distribution

In the beta-binomial model for estimating the proportion $\theta$ the choice of the beta distribution as prior distribution along with the binomial likelihood resulted in having the beta distribution as posterior distribution as well.

If the prior and posterior belong to the same distributional family the prior is called a conjugate prior. This will be the case if the prior has the same functional form as the likelihood. Therefore one also says that the prior is conjugate for the likelihood.

It can be shown that conjugate priors exist for all likelihood functions that are based on data generating models that are exponential families.

In the beta-binomial model the likelihood is based on the binomial distribution and has the following form (only terms depending on the parameter $\theta$ are shown): \[ \theta^{n_1} (1-\theta)^{n_2} \] The form of the beta prior is (again, only showing terms depending on $\theta$): \[ \theta^{\alpha_1-1} (1-\theta)^{\alpha_2-1} \] Since the posterior is proportional to the product of prior and likelihood the posterior will have exactly the same form as the prior: \[ \theta^{\alpha_1+n_1-1} (1-\theta)^{\alpha_2+n_2-1} \] Choosing the prior distribution from a family conjugate for the likelihood greatly simplifies Bayesian analysis since the Bayes formula can then be written in form of an update formula for the parameters of the beta distribution: \[ \alpha_1 \rightarrow \alpha_1 + n_1 = \alpha_1 + n \hat{\theta}_{ML} \] \[ \alpha_2 \rightarrow \alpha_2 + n_2 = \alpha_2 + n (1-\hat{\theta}_{ML}) \]

Thus, conjugate prior distributions are very convenient choices. However, in their application it must be ensured that the prior distribution is flexible enough to encapsulate all prior information that may be available. In cases where this is not the case alternative priors should be used (and most likely this will then require to compute the posterior distribution numerically rather than analytically).

10.2.4 Large sample limits of mean and variance

If $n$ is large and $n >> \alpha, \beta$ then $\lambda \rightarrow 0$ and hence the posterior mean and variance become asympotically

\[ \text{E}(\theta| D) \overset{a}{=} \frac{n_1 }{n} = \hat{\theta}_{ML} \] and \[ \text{Var}(\theta| D) \overset{a}{=} \frac{\hat{\theta}_{ML} (1-\hat{\theta}_{ML})}{n} \]

Thus, if the sample size is large then the Bayes’ estimator turns into the ML estimator! Specifically, the posterior mean becomes the ML point estimate, and the posterior variance is equal to the asymptotic variance computed via the observed Fisher information.

Thus, for large $n$ the data dominate and any details about the prior (such as the settings of the hyperparameters $\alpha_1$ and $\alpha_2$) become irrelevant!

10.2.5 Asymptotic normality of the posterior distribution

Also known as Bayesian Central Limit Theorem (CLT).

Under some regularity conditions (such as regular likelihood and positive prior probability for all parameter values, finite number of parameters, etc.) for large sample size the Bayesian posterior distribution converges to a normal distribution centred around the MLE and with the variance of the MLE:

\[ \text{for large $n$: } p(\boldsymbol \theta| D) \to N(\hat{\boldsymbol \theta}_{ML}, \text{Var}(\hat{\boldsymbol \theta}_{ML}) ) \]

So not only are the posterior mean and variance converging to the MLE and the variance of the MLE for large sample size, but also the posterior distribution itself converges to the sampling distribution!

This holds generally in many regular cases, not just in the simple case above.

The Bayesian CLT is generally known as the Bernstein-von Mises theorem (who discovered it at around 1920–30), but special cases were already known as by Laplace.

In the Worksheet B1 the asymptotic convergence of the posterior distribution to a normal distribution is demonstrated graphically.

10.2.6 Posterior variance for finite $n$

From the Bayesian posterior we can obtain a Bayesian point estimate for the proportion $\theta$ by computing the posterior mean \[ \text{E}(\theta | D) = \frac{\alpha_1+n_1}{k_1} = \hat{\theta}_{\text{Bayes}} \] along with the posterior variance \[ \text{Var}(\theta | D) = \frac{\hat{\theta}_{\text{Bayes}} (1-\hat{\theta}_{\text{Bayes}})}{k_1+1} \]

Asymptotically for large $n$ the posterior mean becomes the maximum likelihood estimate (MLE), and the posterior variance becomes the asymptotic variance of the MLE. Thus, for large $n$ the Bayesian point estimate will be indistinguishable from the MLE and shares its favourable properties.

In addition, for finite sample size the posterior variance will typically be smaller than both the asymptotic posterior variance (for large $n$) and the prior variance, showing that combining the information available in the prior and in the data leads to a more efficient estimate.

10.3 Estimating the mean using the normal-normal model

10.3.1 Normal likelihood

As in Example 3.2 where we estimated the mean parameter by maximum likelihood we assume as data-generating model the normal distribution with unknown mean $\mu$ and known variance $\sigma^2$: \[ x \sim N(\mu, \sigma^2) \] We observe $n$ samples $D = \{x_1, \ldots x_n\}$. This yields using maximum likelihood the estimate $\hat{\mu}_{ML} = \bar{x}$.

We note that the MLE $\hat\mu_{ML}$ is expressed as an average of the data points, which is what enables the linear shrinkage seen below.

10.3.2 Normal prior distribution

The normal distribution is the conjugate distribution for the mean parameter of a normal likelihood, so if we use a normal prior then posterior for $\mu$ is normal as well.

To model the uncertainty about $\mu$ we use the normal distribution in the form $N(\mu, \sigma^2/k)$ with a mean parameter $\mu$ and a concentration parameter $k > 0$ (remember that $\sigma^2$ is given and is also used in the likelihood).

Specifically, we use as normal prior distribution for the mean \[ \mu \sim N\left(\mu_0, \frac{\sigma^2}{k_0}\right) \]

The prior concentration parameter is set to $k_0$
The prior mean parameter is set to $\mu_0$

Hence the prior mean is \[ \text{E}(\mu) = \mu_0 \] and the prior variance \[ \text{Var}(\mu) = \frac{\sigma^2}{k_0} \] where the concentration parameter $k_0$ corresponds the implied sample size of the prior. Note that $k_0$ does not need to be an integer value.

10.3.3 Normal posterior distribution

After observing data $D$ the posterior distribution is also normal with updated parameters $\mu=\mu_1$ and $k_1$ \[ \mu | D \sim N\left(\mu_1, \frac{\sigma^2}{k_1}\right) \]

The posterior concentration parameter is updated to $k_1 = k_0 +n$
The posterior mean parameter is updated to \[ \mu_1 = \lambda \mu_0 + (1-\lambda) \hat\mu_{ML} \] with $\lambda = \frac{k_0}{k_1}$. This can be seen as linear shrinkage of $\hat\mu_{ML}$ towards the prior mean $\mu_0$.

(For a proof see Worksheet B2.)

The posterior mean is \[ \text{E}(\mu | D) = \mu_1 \] and the posterior variance is \[ \text{Var}(\mu | D) = \frac{\sigma^2}{k_1} \]

10.3.4 Large sample asymptotics

For $n$ large and $n >> k_0$ the shrinkage intensity $\lambda \rightarrow 0$ and and $k_1 \rightarrow n$. As a result \[ \text{E}(\mu | D) \overset{a}{=} \hat\mu_{ML} \] \[ \text{Var}(\mu | D) \overset{a}{=} \frac{\sigma^2}{n} \] i.e. we recover the MLE and its asymptotic variance!

Note that for finite $n$ the posterior variance $\frac{\sigma^2}{n+k_0}$ is smaller than both the asymptotic variance $\frac{\sigma^2}{n}$ of the MLE and the prior variance $\frac{\sigma^2}{k_0}$.

10.4 Estimating the variance using the inverse-gamma-normal model

10.4.1 Normal likelihood

As data generating model we use normal distribution \[ x \sim N(\mu, \sigma^2) \] with unknown variance $\sigma^2$ and known mean $\mu$. This yields as maximum likelihood estimate for the variance \[ \widehat{\sigma^2}_{ML}= \frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2 \]

Note that, again, the MLE is an average (of a quadratic function of the individual data points). This enables linear shrinkage of the MLE as seen below.

10.4.2 IG prior distribution

To model the uncertainty about the variance we use the inverse-gamma (IG) distribution, also known as inverse Wishart (IW) distribution (see Appendix for details of this distribution). The IG distribution is conjugate for the variance parameter in the normal likelihood, hence both the prior and the posterior distribution are IG.
As we use the Wishart parameterisation we may equally well call this an inverse Wishart (IW) prior, and the whole model IW-normal model.

Specifically, as prior distribution for $\sigma^2$ we assume using the mean parameter $\mu$ and concentration parameter $\kappa$: \[ \sigma^2 \sim W^{-1}_1(\psi=\kappa_0 \sigma^2_0, \nu=\kappa_0+2) \]

The prior concentration parameter is set to $\kappa_0$
The prior mean parameter is set to $\sigma^2_0$

The corresponding prior mean is \[ \text{E}(\sigma^2) = \sigma^2_0 \] and the prior variance is \[ \text{Var}(\sigma^2) = \frac{2 \sigma_0^4}{\kappa_0-2} \] (note that $\kappa_0 > 2$ for the variance to exist)

10.4.3 IG posterior distribution

After observing $D = \{ x_1 \ldots, x_n\}$ the posterior distribution is also IG with updated parameters: \[ \sigma^2| D \sim W^{-1}_1(\psi=\kappa_1 \sigma^2_1, \nu=\kappa_1+2) \]

The posterior concentration parameter is updated to $\kappa_1 = \kappa_0+n$
The posterior mean parameter update follows the standard linear shrinkage rule: \[ \sigma^2_1 = \lambda \sigma^2_0 + (1-\lambda) \widehat{\sigma^2}_{ML} \] with $\lambda=\frac{\kappa_0}{\kappa_1}$.

The posterior mean is \[ \text{E}(\sigma^2 | D) = \sigma^2_1 \] and the posterior variance \[ \text{Var}(\sigma^2 | D) = \frac{ 2 \sigma^4_1}{\kappa_1-2} \]

10.4.4 Large sample asymptotics

For large sample size $n$ with $n >> \kappa_0$ the shrinkage intensity vanishes ($\lambda \rightarrow 0$) and therefore $\sigma^2_1 \rightarrow \widehat{\sigma^2}_{ML}$. We also find that $\kappa_1-2 \rightarrow n$.

This results in the asymptotic posterior mean \[ \text{E}(\sigma^2 | D) \overset{a}{=} \widehat{\sigma^2}_{ML} \] and the asymptotic posterior variance \[ \text{Var}(\sigma^2 | D) \overset{a}{=} \frac{2 (\widehat{\sigma^2}_{ML})^2}{n} \] Thus we recover the MLE of $\sigma^2$ and its asymptotic variance.

10.4.5 Other equivalent update rules

Above the update rule from prior to posterior inverse gamma distribution is stated for the mean parameterisation:

$\kappa_0 \rightarrow \kappa_1 = \kappa_0+n$
$\sigma^2_0 \rightarrow \sigma^2_1 = \lambda \sigma^2_0 + (1-\lambda) \widehat{\sigma^2}_{ML}$ with $\lambda=\frac{\kappa_0}{\kappa_1}$

This has the advantage that the mean of the inverse gamma distribution is updated directly, and that the prior and posterior variance is also straightforward to compute.

The same update rule can also be expressed in terms of the other parameterisations. In terms of the conventional parameters $\alpha$ and $\beta$ of the inverse gamma distribution the update rule is

$\alpha_0 \rightarrow \alpha_1 = \alpha_0 +\frac{n}{2}$
$\beta_0 \rightarrow \beta_1 = \beta_0 + \frac{n}{2} \widehat{\sigma^2}_{ML} = \beta_0 + \frac{1}{2} \sum_{i=1}^n (x_i-\mu)^2$

For the parameters $\psi$ and $\nu$ of the univariate inverse Wishart distribution the update rule is

$\nu_0 \rightarrow \nu_1 = \nu_0 +n$
$\psi_0 \rightarrow \psi_1 = \psi_0 + n \widehat{\sigma^2}_{ML} = \psi_0 + \sum_{i=1}^n (x_i-\mu)^2$

For the parameters $\tau^2$ and $\nu$ of the scaled inverse chi-squared distribution the update rule is

$\nu_0 \rightarrow \nu_1 = \nu_0 +n$
$\tau^2_0 \rightarrow \tau^2_1 = \frac{\nu_0}{\nu_1} \tau^2_0 + \frac{n}{\nu_1} \widehat{\sigma^2}_{ML}$

(See Worksheet B3 for proof of equivalence of all above update rules.)

10.5 Estimating the precision using the gamma-normal model

10.5.1 MLE of the precision

Instead of estimating the variance $\sigma^2$ we may wish to estimate the precision $w1/\sigma^2$, i.e. the inverse of variance.

As above the data generating model is a normal distribution \[ x \sim N(\mu, 1/w) \] with unknown precision $w$ and known mean $\mu$. This yields as maximum likelihood estimate (easily derived thanks to the invariance principle) \[ \hat{w}_{ML} = \frac{ 1}{\widehat{\sigma^2}_{ML} } = \frac{1}{\frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2} \] Crucially, the MLE of the precision $w$ is not an average itself (instead, it is a function of an average). As a consequence, as seen below, the posterior mean of $w$ cannot be written as linear adjustment of the MLE.

10.5.2 Gamma (Wishart) prior

For modelling the variance we have used an inverse gamma (inverse Wishart) distribution for the prior and posterior distributions. Thus, in order to model the precision we therefore now use a gamma (Wishart) distribution.

Specifically, we use the Wishart distribution in the mean parameterisation (see Appendix): \[ w \sim W_1(s^2 = w_0/k_0, k=k_0) \]

The prior concentration parameter is set to $k_0$
The prior mean parameter is set to $w_0$

The corresponding prior mean is \[ \text{E}(w) = w_0 \] and the prior variance is \[ \text{Var}(\sigma^2) = 2 w_0^2/ k_0 \]

10.5.3 Gamma / Wishart posterior

After observing $D = \{ x_1 \ldots, x_n\}$ the posterior distribution is also gamma resp. Wishart with updated parameters:

\[ w | D \sim W_1(s^2 = w_1/k_1, k=k_1) \]

The posterior concentration parameter is updated to $k_1 = k_0+n$
The posterior mean parameter update follows the update: \[ \frac{1}{w_1} = \lambda \frac{1}{w_0} + (1-\lambda) \frac{1}{\hat{w}_{ML}} \] with $\lambda = \frac{k_0}{k_1}$. Crucially, the linear update is applied to the inverse of the precision but not to the precision itself. This is because the MLE of the precision parameter cannot be expressed as an average.
Equivalent update rules are for the inverse scale parameter $s^2$ \[ \frac{1}{s^2_1} = \frac{1}{s^2_0} + n \widehat{\sigma^2}_{ML} \] and for the rate parameter $\beta = 1/(2 s^2)$ of the gamma distribution \[ \beta_1 = \beta_0 + \frac{n}{2} \widehat{\sigma^2}_{ML} \] This is the form you will find most often in textbooks.

The posterior mean is \[ \text{E}(w | D) = w_1 \] and the posterior variance \[ \text{Var}(w | D) = 2 w_1^2/ k_1 \]

9 Essentials of Bayesian statistics

11 Bayesian model comparison

Statistical Methods: Likelihood, Bayes and Regression