13 Optimality properties and summary

13.1 Bayesian statistics in a nutshell

  • Bayesian statistics explicitly models the uncertainty about the parameters of interest by probability
  • In the light of new evidence (observed data) the uncertainty is updated, i.e. the prior distribution is combined via Bayes rule with the likelihood to form the posterior distribution
  • If the posterior distribution is in same family as the prior \(\rightarrow\) conjugate prior.
  • In an exponential family the Bayesian update of the mean is always expressible as linear shrinkage of the MLE.
  • For large sample size the posterior mean becomes maximum likelihood estimator and the prior playes no role.
  • Conversely, for small sample size if no data is available the posterior stays close the prior..

13.1.1 Advantages

  • Adding prior information has regularisation properties. This is very important in more complex models with many parameters, e.g., in the estimation of a covariance matrix (to avoid singularity).
  • Improves small-sample accuracy (e.g. MSE)
  • Bayesian estimators tend to perform better than MLEs - this is not surprising as they use the observed data plus the extra information available in the prior.
  • Bayesian credible intervals are conceptually much more simple than frequentist confidence intervals.

13.1.2 Frequentist properties of Bayesian estimators

A Bayesian point estimator (e.g. the posterior mean) can also be assessed by its frequentist properties.

  • First, by construction due to introducing a prior the Bayesian estimator will be biased for finite \(n\) even if the MLE is unbiased.
  • Second, intriguingly it turns out that the sampling variance of the Bayes point estimator (not to be confused with the posterior variance!) can be smaller than the variance of the MLE. This depends on the choice of the shrinkage parameter \(\lambda\) that also determines the posterior variance.

As a result, Bayesian estimators may have smaller MSE (=squared bias + variance) than the ML estimator for finite \(n\).

In statistical decision theory this is called the theorem of admissibility of Bayes rules. It states that under mild conditions every admissible estimation rule (i.e. one that dominates all other estimators with regard to some expected loss, such as the MSE) is in fact a Bayes estimator with some prior.

Unfortunately, this theorem does not tell which prior is needed to achive optimality, however an optimal estimator can often be found by tuning the hyperparameters.

13.1.3 Specifying the prior — problem or advantage?

In Bayesian statistics the data analyst needs to be very explicit about the modelling assumptions:

Model = data generating process (likelihood) + prior uncertainty (prior distribution)

Note that alternative statistical methods can often be interpreted as Bayesian methods assuming a specific implicit prior!

For example, likelihood estimation for the binomial model is equivalent to Bayes estimation using the Beta-Binomial model with a \(\text{Beta}(0,0)\) prior (=Haldane prior).
However, when choosing a prior explicitly for this model, interestingly most analysts would rather use a flat prior \(\text{Beta}(1,1)\) (=Laplace prior) with implicit sample size \(k_0=2\) or a transformation-invariant prior \(\text{Beta}(1/2, 1/2)\) (=Jeffreys prior) with implicit sample size \(k_0=1\) rather than the Haldane prior!

\(\rightarrow\) be aware about the implicit priors!!

Better to acknowledge that a prior is being used (even if implicit!)
Being specific about all your assumptions is enforced by the Bayesian approach.

Specifying a prior is thus best understood as an intrinsic part of model specification. It helps to improve inference and it may only be ignored if there is lots of data.

13.2 Optimality of Bayesian inference

The optimality of Bayesian model making use of full model specification (likelihood plus prior) can be shown from a number of different perspectives. Correspondingly, there are many theorems that prove (or at least indicate) this optimality:

  1. Richard Cox’s theorem: generalising classical logic invariably leads to Bayesian inference.

  2. de Finetti’s representation theorem: joint distribution of exchangeable observations can always be expressed as weighted mixture over a prior distribution for the parameter of the model. This implies the existence of the prior distribution and the requirement of a Bayesian approach.

  3. Frequentist decision theory: all admissible decision rules are Bayes rules!

  4. Entropy perspective: The posterior density (a function!) is obtained as a result of optimising an entropy criterion. Bayesian updating may thus be viewed as a variational optimisation problem. Specifically, Bayes theorem is the minimal update when new information arrives in form of observations (see below).

Remark: there exist a number of further (often somewhat esoteric) suggestions for propagating uncertainty such as “fuzzy logic”, imprecise probabilities, etc. These contradict Bayesian learning and are thus in direct violation of the above theorems.

13.3 Connection with entropy learning

The Bayesian update rule is a very general form of learning when the new information arrives in the form of data. But actually there is an even more general principle of which the Bayesian update rule is just a special case: the principle of minimal information update (e.g. Jaynes 1959, 2003) or principle of minimum information discrimination (MDI) (Kullback 1959).

It can be summarised as follows: Change your beliefs only as much as necessary to be coherent with new evidence!

Under this principle of “inertia of beliefs” when new information arrives the uncertainty about a parameter is only minimally adjusted, only as much as needed to account for the new information. To implement this principle KL divergence is a natural measure to quantify the change of the underlying beliefs. This is known as entropy learning.

The Bayes rule emerges a special case of entropy learning:

  • The KL divergence between the joint posterior \(Q_{x,\boldsymbol \theta}\) and joint prior distribution \(P_{x,\boldsymbol \theta}\) is computed, with the posterior distribution \(Q_{\boldsymbol \theta|x}\) as free parameter.
  • The conditional distribution \(Q_{\boldsymbol \theta|x}\) is found by minimising the KL divergence \(D_{\text{KL}}(Q_{x,\boldsymbol \theta}, P_{x,\boldsymbol \theta})\).
  • The optimal solution to this variational optimisation problem is given by Bayes’ rule!

This application of the KL divergence is an example of reverse KL optimisation (aka \(I\)-projection, see Part I of the notes). Intringuingly, this explains the zero forcing property of Bayes’ rule (because that this is a general property of an \(I\)-projection).

Applying entropy learning therefore includes Bayesian learning as special case:

  1. If information arrives in form of data \(\rightarrow\) update prior by Bayes’ theorem (Bayesian learning).

Interestingly, entropy learning will lead to other update rules for other types of information:

  1. If information arrives in the form of another distribution \(\rightarrow\) update using R. Jeffrey’s rule of conditioning (1965).

  2. If the information is presented in the form of constraints \(\rightarrow\) Kullback’s principle of minimum MDI (1959), E. T. Jaynes maximum entropy (MaxEnt) principle (1957).

This shows (again) how fundamentally important KL divergence is in statistics. It not only leads to likelihood inference (via forward KL) but also to Bayesian learning, as well as to other forms of information updating (via reverse KL).

Furthermore, in Bayesian statistics relative entropy is useful to choose priors (e.g. reference priors) and it also helps in (Bayesian) experimental design to quantify the information provided by an experiment.

13.4 Conclusion

Bayesian statistics offers a coherent framework for statistical learning from data, with methods for

  • estimation
  • testing
  • model building

There are a number of theorems that show that “optimal” estimators (defined in various ways) are all Bayesian.

It is conceptually very simple — but can be computationally very involved!

It provides a coherent generalisation of classical TRUE/FALSE logic (and therefore does not suffer from some of the inconsistencies prevalent in frequentist statistics).

Bayesian statistics is a non-asymptotic theory, it works for any sample size. Asympotically (large \(n\)) it is consistent and converges to the true model (like ML!). But Bayesian reasoning can also be applied to events that take place only once — no assumption of hypothetical infinitely many repetitions as in frequentist statistics is needed.

Moreover, many classical (frequentist) procedures may be viewed as approximations to Bayesian methods and estimators, so using classical approaches in the correct application domain is perfectly in line with the Bayesian framework.

Bayesian estimation and inference also automatically regularises (via the prior) which is important for complex models and when there is the problem of overfitting.