2  Distributions

Choosing appropriate distributions for statistical modelling is a crucial aspect of probabilistic data analysis. This chapter explores various factors to consider when selecting suitable distributions and also lists frequently used families.

2.1 Characteristic features

Distributions can be differentiated by a number of characteristics.

Firstly, by the type of random variable:

  • discrete versus continuous
  • univariate versus multivariate

Secondly, by the support of the random variable, with typical ranges such as:

  • finite discrete support, e.g. \(\{1, 2, \ldots, n\}\)
  • infinite discrete support, e.g. \(\{1, 2, \ldots\}\)
  • \([0,1]\)
  • \([-\infty, \infty]\)
  • \([0, \infty]\)

The choice of support will depend on the intended use of the random variable in the model. Common applications include:

  • proportion
  • location
  • scale
  • mean
  • variance
  • spread
  • concentration
  • shape
  • rate
  • (squared) correlation

These interpretations apply both to the random variable itself but also to the parameter of a distribution family. For instance, we might select a distribution whose outcomes can be interpreted as proportions (e.g. the beta distribution). Alternatively, we might choose a distribution family in which a parameter itself represents a proportion (e.g. the Bernoulli distribution).

A third consideration may be the general shape of the distribution:

  • symmetric or asymmetric
  • left or right skewed
  • short tails or long tails
  • unimodal or multimodal

A further characteristic of a distribution family is the number of parameters, with choices such as

  • single parameter
  • multiple parameters
  • multiple types of parameters (e.g. location+scale)
  • nonparametric model (i.e. highly parametric model)

A distribution family consists of a finite or infinite set of distributions that correspond to specific instances of parameter values.

Models with many parameters or nonparametric models are employed when there is an abundance of data and they typically make weak assumptions about the data-generating process. Conversely, simpler parametric models are generally preferred for smaller sample sizes, and typically make stronger assumptions about how the data were generated.

In data analysis, we aim to build models that are complex enough to capture the essential features of the data, but simple enough to avoiding overfitting. Consequently, when two models have similar explanatory and predictive power, the one with fewer parameters is generally preferred.

2.2 Commonly used distribution families

In this module we will often make use of the following common univariate distributions:

  1. Binomial distribution \(\operatorname{Bin}(n, \theta)\), with support \(\{0, 1, \ldots, n\}\).

    As special case (\(n=1)\) is:

    • Bernoulli distribution \(\operatorname{Ber}(\theta)\), with support \(\{0, 1\}\).
  2. Beta distribution \(\operatorname{Beta}(\alpha, \beta)\), with support \([0, 1]\).

  3. Normal distribution \(N(\mu, \sigma^2)\), with support \([-\infty, \infty]\).

  4. Gamma distribution \(\operatorname{Gam}(\alpha, \theta)\), with support \([0, \infty]\). It is also known as univariate Wishart distribution \(\operatorname{Wis}\left(s^2, k \right)\).

    Special cases of the gamma/Wishart distribution are:

    • scaled chi-squared distribution \(s^2 \chi^2_{k}\) (discrete \(k\))
    • chi-squared distribution \(\chi^2_{k}\) (discrete \(k\), \(s^2=1\))
    • exponential distribution \(\operatorname{Exp}(\theta)\) (\(\alpha=1\))
  5. Inverse gamma distribution \(\operatorname{IG}(\alpha, \beta)\), with support \([0, \infty]\). Also know as univariate inverse Wishart distribution \(\operatorname{IW}(\psi, k)\).

All of the distributions above belong to the class of exponential families (see Section 2.4). These families have many properties that are highly convenient for statistical analysis.

  1. Location-scale \(t\)-distribution \(t_{\nu}(\mu, \tau^2)\), with support \([-\infty, \infty]\).

    Special cases of the location-scale \(t\)-distribution are:

    • Student’s \(t\)-distribution \(t_\nu\)
    • Cauchy distribution \(\operatorname{Cau}(\mu, \tau)\)

    The location-scale \(t\)-distribution is generalisation of the normal distribution but with more probability mass in the tails. Depending on the choice of the degrees of freedom \(\nu\), not all moments of the distribution may exist. Furthermore, it’s not an exponential family.

For all of the above univariate distribution there exist corresponding multivariate variants. In this module we will make use of the following multivariate distributions:

  1. Multinomial distribution \(\operatorname{Mult}(n, \boldsymbol \theta)\), generalising the binomial distribution.

    Special case (\(n=1)\):

    • Categorical distribution \(\operatorname{Cat}(\boldsymbol \theta)\), generalising the Bernoulli distribution.
  2. Multivariate normal distribution \(N_d(\boldsymbol \mu, \boldsymbol \Sigma)\), generalising the univariate normal distribution.

A distribution family can be parametrised in multiple equivalent ways. Typically, there is a standard parametrisation, and also a mean parametrisation, where one of the parameters can be interpreted as the mean of the data. Sometimes, the same distribution is referred to by different names and there are various default parametrisations.

Importantly, any parametrisation is a matter of choice and simply provides as an alternative means to index the elementary distributions within the family. However, certain parametrisations may be more interpretable or offer computational advantages.

2.3 Model building

Choosing the right distribution

When choosing a distribution we typically aim to to align the characteristics of the distribution with those of the observations. For instance, if the data exhibit long tails, we will need to use a long-tailed model. Additionally, there may be a mechanistic rationale, such as a physical law, suggesting that the underlying process follows a particular model.

In many cases, the central limit theorem justifies using a normal distribution.

Another approach to selecting a distribution family is to fix certain properties of the distribution, such as its mean and variance, and then selecting the family that maximises the spread of the probability mass. This method is closely linked to the principle of maximum entropy, which will be discussed in more detail in Chapter 6. It is also helps to explain why exponential families are often preferred in statistical modelling.

Complex models

Statistical analysis often uses models that are composed of many random variables. These models can be complex, with hierarchical or network-like structures that connect observed and latent variables, and potentially nonlinear functional relationships. Even so, the most sophisticated statistical models are constructed from simpler, more fundamental components.

Specifically, the large class of graphical models provide a principled means to form complex joint distributions for observed and unobserved random variables built from more elementary components. This include regression models, mixture models and compound models (continuous version of mixture models) as well as more general network-like and hierarchically structured models.

In these complex models some of the underlying elementary distributions will serve to model the observed output while others represent internal variables or account for the uncertainty regarding a parameter (in a Bayesian context).

In statistical course units in year 3 and year 4 you will discuss and learn about many types of advanced models, related for instance to

  • multivariate statistics and machine learning
  • temporal and spatial modelling, and
  • generalised linear and nonparametric models.

Iterative refinement

Models and distributions are best considered as approximations of the true unknown data-generating process. The aim of data analysis is therefore to find models that capture the essential properties at an appropriate level of detail1.

This is typically done in an iterative fashion, either starting from a simple model and increasing complexity as needed, or alternatively, starting from a highly parametrised model and simplifying it. In either approach systematic methods are required to compare different models, to assess their fit to data and how well they predict future observations. Statistics provides principled tools for all these tasks.

2.4 \(\color{Red} \blacktriangleright\) Exponential families

Overview

Many commonly used distributions in statistics are exponential families, including core examples such as the Bernoulli distribution and the normal distribution.

Exponential families are central in probability and statistics. They support effective statistical learning using likelihood and Bayesian approaches, enable data reduction via minimal sufficiency and provide the basis for generalised linear models. Furthermore, exponential families often allow to generalise results established for specific cases, such as the normal distribution, to a broader domain.

Definition

An exponential family \(P(\boldsymbol \eta)\) arises by exponential tilting of a base distribution \(B\) with (typically unnormalised) base function \(h(x)\) toward the linear combination \(\boldsymbol \eta^T \boldsymbol t(x)\) of the canonical statistics \(\boldsymbol t(x)\) and the canonical parameters \(\boldsymbol \eta\). This yields a pdmf of the form \[ p(x|\boldsymbol \eta) = \underbrace{e^{ \langle \boldsymbol \eta, \boldsymbol t(x) \rangle }}_{\text{exponential tilt}}\, h(x) \, /\, z(\boldsymbol \eta) \]

The base pdmf is obtained at \(\boldsymbol \eta=0\) yielding \(b(x) = p(x | \boldsymbol \eta=0) = h(x) / z(0)\). If \(h(x)\) is already a normalised pdmf then \(z(0)=1\) and \(b(x)=h(x)\).

The above presentation of exponential families assumes a univariate random variable (scalar \(x\)) but also applies to multivariate random variables (vector \(\boldsymbol x\) or matrix \(\boldsymbol X\)) .

Likewise, canonical statistics and parameters are written as vectors but these may also be scalars or matrices (or a combination of both). The use of inner product notation \(\langle, \rangle\) includes all these cases, vectorising matrices as required, recalling that \(\langle \boldsymbol A, \boldsymbol B\rangle = \operatorname{Tr}( \boldsymbol A^T \boldsymbol B) = \operatorname{Vec}(\boldsymbol A)^T \operatorname{Vec}(\boldsymbol B)\).

Canonical statistics and canonical parameters

The canonical statistics \(\boldsymbol t(x)\) are transformations of \(x\), usually simple functions such the identity (\(x\)), the square (\(x^2\)), the inverse (\(1/x\)) or the logarithm (\(\log x\)). Typically, the dimension of \(\boldsymbol t(x)\) is small.

The canonical statistics \(\boldsymbol t(x)\) may be affinely dependent. If this is the case there is a vector \(\boldsymbol \eta_0\) for which
\[ \langle \boldsymbol \eta_0, \boldsymbol t(x) \rangle = \text{const.} \] If the elements in \(\boldsymbol t(x)\) are affinely independent the representation of the exponential family is minimal or complete, otherwise the representation is non-minimal or overcomplete.

For each canonical statistic there is a corresponding canonical parameter so the dimensions and shape of \(\boldsymbol t(x)\) and \(\boldsymbol \eta\) match.

In a minimal representation the canonical parameters of the exponential family are identifiable and hence distinct parameter settings for \(\boldsymbol \eta\) yield distinct distributions. Conversely, in a non-minimal or overcomplete representation there are redundant elements in the canonical parameters \(\boldsymbol \eta\) and the distributions within the exponential family are not identifiable. Specifically, there will be multiple \(\boldsymbol \eta\) yielding the same underlying distribution.

The canonical parameters \(\boldsymbol \eta\) are typically some transformation of the conventional parameters \(\boldsymbol \theta\).

Partition function

The normaliser or partition function \(z(\boldsymbol \eta)\) ensures that \(p(x|\boldsymbol \eta)\) integrates to one, with \[ z(\boldsymbol \eta) = \int_x \, e^{ \langle \boldsymbol \eta, \boldsymbol t(x) \rangle}\, h(x) \, dx \] For discrete x replace the integral by a sum.

The set of values of \(\boldsymbol \eta\) for which \(z(\boldsymbol \eta) < \infty\), and hence for which \(p(x|\boldsymbol \eta)\) is well defined, comprises the parameter space of the exponential family. Some choices of \(h(x)\) and \(\boldsymbol t(x)\) do not yield a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.

The log-normaliser or log-partition function \[ a(\boldsymbol \eta) = \log z(\boldsymbol \eta) \] allows to compute the cumulants of the canonical statistics. In particular, its gradient yields the mean \[ \begin{split} \operatorname{E}( \boldsymbol t(x) ) = \boldsymbol \mu_{\boldsymbol t} & = \nabla a(\boldsymbol \eta)\\ &= \frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)} \end{split} \] and the Hessian matrix the variance \[ \begin{split} \operatorname{Var}( \boldsymbol t(x) ) = \boldsymbol \Sigma_{\boldsymbol t} & = \nabla \nabla^T a(\boldsymbol \eta)\\ &= \frac{\nabla \nabla^T z(\boldsymbol \eta)}{z(\boldsymbol \eta)} - \left(\frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)}\right) \left(\frac{\nabla z(\boldsymbol \eta)}{z(\boldsymbol \eta)}\right)^T \end{split} \]

For a minimal exponential family \(\boldsymbol \Sigma_{\boldsymbol t}\) is a positive definite matrix and invertible, whereas for non-minimal representations the covariance matrix is positive semi-definite and not invertible.

The means \(\boldsymbol \mu_{\boldsymbol t}\) of the canonical statistics \(\boldsymbol t(x)\) provide a further parametrisation of exponential families, in addition to the canonical parameters \(\boldsymbol \eta\) and the conventional parameters \(\boldsymbol \theta\), and are called the expectation parameters.

Examples

Example 2.1 \(\color{Red} \blacktriangleright\) Bernoulli distribution \(\operatorname{Ber}(\theta)\) as exponential family:

The Bernoulli distribution \(\operatorname{Ber}(\theta)\) can be specified in exponential family form as follows:

  • \(x \in \{0, 1\}\)
  • canonical statistic \(t(x) = x\)
  • base function \(h(x) = 1\)
  • canonical parameter \(\eta\)

This results in the partition function \[ z(\eta) = \sum_{x \in \{0,1\}} e^{\eta x} = 1+e^\eta \] and the log-partition function \(a(\eta) = \log z(\eta)\), both of which are defined for \(\eta \in \mathbb{R}\).

The first and second derivatives of the partition function \(z(\eta)\) are \(z'(\eta) = e^\eta\) and \(z''(\eta) = e^\eta\).

The mean \(\mu_t\) of the canonical statistic \(t(x)=x\) is given by \[ \begin{split} \mu_t = a'(\eta) &= \frac{z'(\eta)}{z(\eta)} = \frac{ e^{\eta}}{1+e^{\eta}} \\ & = \operatorname{logit}^{-1}(\eta) \\ &= \theta \end{split} \] Since \(z(x)=x\) for the Bernoulli distribution the expectation parameter \(\mu_t\) corresponds to the conventional mean parameter \(\operatorname{E}(x) = \theta \in [0, 1]\). In the above \(\mu_t\) and \(\theta\) are obtained from \(\eta\) by the inverse logit function (also known as the logistic function).

Conversely, the canonical parameter \(\eta\) can be computed from \(\theta\) (or \(\mu_t\)) by the logit function \[ \eta = \operatorname{logit}\theta = \log\left(\frac{\theta}{1-\theta}\right) \]

The variance \(\sigma^2_t\) of the canonical statistic \(t(x)=x\) is \[ \begin{split} \sigma^2_t =a''(\eta) &= \frac{z''(\eta)}{z(\eta)} - \left(\frac{z'(\eta)}{z(\eta)}\right)^2 \\ &= \frac{ e^{\eta}}{(1+e^{\eta})^2} \\ &= \theta (1-\theta) \end{split} \]

The pmf of the Bernoulli distribution is \[ p(x | \eta) = \frac{ e^{\eta x} }{ z(\eta) } = \frac{ e^{\eta x} }{ 1+e^\eta } \] which in terms of the conventional parameter \(\theta\) takes on the familiar form \[ \begin{split} p(x | \theta) &= \theta^x \, (1-\theta)^{1-x} \\ &= \begin{cases} \theta &\text{if } x=1\\ 1-\theta &\text{if } x=0 \end{cases} \end{split} \]

Example 2.2 \(\color{Red} \blacktriangleright\) Normal distribution \(N(\mu, \sigma^2)\) as exponential family:

The two-parameter normal distribution \(N(\mu, \sigma^2)\) can be written in exponential family form as follows:

  • \(x \in \mathbb{R}\)
  • canonical statistics \(\boldsymbol t(x) = (x, x^2)^T\)
  • base function \(h(x) = 1\)
  • canonical parameters \(\boldsymbol \eta= (\eta_1, \eta_2)^T\)

This results in the partition function \[ z(\boldsymbol \eta) = \left( -\frac{\pi}{\eta_2} \right)^{1/2} \, \exp\left(-\frac{\eta_1^2}{4\eta_2}\right) \] and the log-partition function \[ a(\boldsymbol \eta) =-\frac{\eta_1^2}{4 \eta_2} +\frac{1}{2} \log\left(-\frac{\pi}{\eta_2} \right) \] which are defined for \(\eta_1 \in \mathbb{R}\) and \(\eta_2 \in \mathbb{R}^{-}\).

The mean \(\boldsymbol \mu_{\boldsymbol t}\) of the canonical statistics \(\boldsymbol t(x) = (x, x^2)^T\) is given by \[ \begin{split} \boldsymbol \mu_{\boldsymbol t} &= \nabla a(\boldsymbol \eta) \\ &= \begin{pmatrix} -\frac{\eta_1}{2 \eta_2} \\ \frac{\eta_1^2}{4 \eta_2^2} - \frac{1}{2\eta_2}\\ \end{pmatrix}\\ &=\begin{pmatrix}\mu \\ \mu^2+ \sigma^2 \end{pmatrix} \end{split} \] The conventional parameters \(\mu = \operatorname{E}(x)\) and \(\sigma^2=\operatorname{Var}(x)\) are thus directly linked to the expectation parameters \(\boldsymbol \mu_{\boldsymbol t}\) and can be obtained from the canonical parameters by \(\mu = -\frac{\eta_1}{2\eta_2}\) and \(\sigma^2 = -\frac{1}{2\eta_2}\). Conversely, we have \[ \boldsymbol \eta= \begin{pmatrix} \eta_1 \\ \eta_2 \end{pmatrix} = \begin{pmatrix} \frac{\mu}{\sigma^2} \\ - \frac{1}{2 \sigma^2} \end{pmatrix} \]

The covariance matrix of the canonical statistics \(\boldsymbol t(x)\) is \[ \begin{split} \boldsymbol \Sigma_{\boldsymbol t} &= \begin{pmatrix} \operatorname{Var}(x) & \operatorname{Cov}(x, x^2) \\ \operatorname{Cov}(x^2, x) & \operatorname{Var}(x^2) \\ \end{pmatrix}\\ &= \nabla \nabla^T a(\boldsymbol \eta) \\ &= \begin{pmatrix} -\frac{1}{2\eta_2} & \frac{\eta_1}{2 \eta_2^2} \\ \frac{\eta_1}{2 \eta_2^2} & \frac{ \eta_2 -\eta_1^2 }{2 \eta_2^3} \\ \end{pmatrix}\\ &= \begin{pmatrix} \sigma^2 & 2 \mu \sigma^2 \\ 2 \mu \sigma^2 & 2 \sigma^4 +4 \mu^2 \sigma^2 \\ \end{pmatrix}\\ \end{split} \]

The pdf in terms of \(\boldsymbol \eta\) is \[ p(x | \boldsymbol \eta) = \left( -\frac{\pi}{\eta_2} \right)^{-1/2} \, \exp\left( \eta_1 x + \eta_2 x^2 +\frac{\eta_1^2}{4\eta_2}\right) \\ \] which in terms of the standard parameters \(\mu\) and \(\sigma^2\) takes on the familiar form \[ p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( - \frac{(x- \mu)^2}{2\sigma^2 } \right) \]

2.5 Further reading

For details of the distributions listed above and additional background on exponential families see the supplementary Probability and Distribution Refresher notes.

A recent textbook on exponential families is Efron (2022).

TipA bit of history

The concept of exponential families in statistics was independently developed in the 1930s by Georges Darmois (1888–1960), Edwin J. G. Pitman (1897–1993), and Bernard Koopman (1900–1981). However, exponential families were introduced earlier in statistical mechanics in the 1870s by Josiah W. Gibbs (1839-1903) and Ludwig Boltzmann (1844-1906).

In statistics, exponential families were motivated by identifying distributions allowing minimal sufficient statistics whereas in physics the objective was to find distributions maximising entropy.


  1. In fact, processes at one length or time scale can often be modelled independently of those at other scales. This general phenomenon is known as decoupling of scales.↩︎