2  Distributions for statistical models

Choosing appropriate distributions for statistical modelling is a crucial aspect of probabilistic data analysis. This chapter explores various factors to consider when selecting suitable distributions and also lists frequently used families.

2.1 Characteristic features

Distributions can be differentiated by a number of characteristics.

Firstly, by the type of random variable:

  • discrete versus continuous
  • univariate versus multivariate

Secondly, by the support of the random variable, with typical ranges such as:

  • finite discrete support, e.g. \(\{1, 2, \ldots, n\}\)
  • infinite discrete support, e.g. \(\{1, 2, \ldots\}\)
  • \([0,1]\)
  • \([-\infty, \infty]\)
  • \([0, \infty]\)

The choice of support will depend on the intended use of the random variable in the model. Common applications include:

  • proportion
  • location
  • scale
  • mean
  • variance
  • spread
  • concentration
  • shape
  • rate
  • (squared) correlation

These interpretations apply both to the random variable itself but also to the parameter of a distribution family. For instance, we might select a distribution whose outcomes can be interpreted as proportions (e.g. the beta distribution). Alternatively, we might choose a distribution family in which a parameter itself represents a proportion (e.g. the Bernoulli distribution).

A third consideration may be the general shape of the distribution:

  • symmetric or asymmetric
  • left or right skewed
  • short tails or long tails
  • unimodal or multimodal

A further characteristic of a distribution family is the number of parameters, with choices such as

  • single parameter
  • multiple parameters
  • multiple types of parameters (e.g. location+scale)

A distribution family consists of a finite or infinite set of distributions that correspond to specific instances of parameter values.

In data analysis, we aim to build models that are complex enough to capture the essential features of the data, but simple enough to avoiding overfitting. Consequently, when two models have similar explanatory and predictive power, the one with fewer parameters is generally preferred.

Lastly, it is important to take into account the general structure of the distribution:

  • parametric versus nonparametric models
  • exponential family versus non-exponential family
  • special exponential families, e.g. Gibbs family, natural exponential family (NEF)

Nonparametric models with fewer assumptions about the data generating process are typically employed when there is an abundance of data. Conversely, simpler parametric models are generally preferred for smaller sample sizes.

2.2 Commonly used distribution families

In this module we will often make use of the following common univariate distributions:

  1. Binomial distribution \(\text{Bin}(n, \theta)\), with support \(\{0, 1, \ldots, n\}\).

    As special case (\(n=1)\) is:

    • Bernoulli distribution \(\text{Ber}(\theta)\), with support \(\{0, 1\}\).
  2. Beta distribution \(\text{Beta}(\alpha, \beta)\), with support \([0, 1]\).

  3. Normal distribution \(N(\mu, \sigma^2)\), with support \([-\infty, \infty]\).

  4. Gamma distribution \(\text{Gam}(\alpha, \theta)\), with support \([0, \infty]\). It is also known as univariate Wishart distribution \(\text{Wis}\left(s^2, k \right)\).

    Special cases of the gamma/Wishart distribution are:

    • scaled chi-squared distribution \(s^2 \text{$\chi^2_{k}$}\) (discrete \(k\))
    • chi-squared distribution \(\text{$\chi^2_{k}$}\) (discrete \(k\), \(s^2=1\))
    • exponential distribution \(\text{Exp}(\theta)\) (\(\alpha=1\))
  5. Inverse gamma distribution \(\text{IG}(\alpha, \beta)\), with support \([0, \infty]\). Also know as univariate inverse Wishart distribution \(\text{IW}(\psi, k)\).

All the above distributions are so-called exponential families. As such they can be written in a particular structural form. Exponential families have many useful properties that facilitate statistical analysis.

  1. Location-scale \(t\)-distribution \(\text{$t_{\nu}$}(\mu, \tau^2)\), with support \([-\infty, \infty]\).

    Special cases of the location-scale \(t\)-distribution are:

    • Student’s \(t\)-distribution \(t_\nu\)
    • Cauchy distribution \(\text{Cau}(\mu, \tau)\)

    The location-scale \(t\)-distribution is generalisation of the normal distribution but with more probability mass in the tails. Depending on the choice of the degrees of freedom \(\nu\), not all moments of the distribution may exist. Furthermore, it’s not an exponential family.

For all of the above univariate distribution there exist corresponding multivariate variants. In this module we will make use of the following multivariate distributions:

  1. Multinomial distribution \(\text{Mult}(n, \boldsymbol \pi)\), generalising the binomial distribution.

    Special case (\(n=1)\):

    • Categorical distribution \(\text{Cat}(\boldsymbol \pi)\), generalising the Bernoulli distribution.
  2. Multivariate normal distribution \(N_d(\boldsymbol \mu, \boldsymbol \Sigma)\), generalising the univariate normal distribution.

A distribution family can be parametrised in multiple equivalent ways. Typically, there is a standard parametrisation, and also a mean parametrisation, where one of the parameters can be interpreted as the mean. Sometimes, the same distribution is referred to by different names and there are various default parametrisations.

Importantly, any parametrisation is a matter of choice and simply provides as an alternative means to index the elementary distributions within the family. However, certain parametrisations may be more interpretable or offer computational advantages.

2.3 Model building

Choosing the right distribution

When choosing a distribution we typically aim to to align the characteristics of the distribution with those of the observations. For instance, if the data exhibit long tails, we will need to use a long-tailed model. Additionally, there may be a mechanistic rationale, such as a physical law, suggesting that the underlying process follows a particular model.

In many cases, the central limit theorem justifies using a normal distribution.

Another approach to selecting a distribution family is to fix certain properties of the distribution, such as its mean and variance, and then selecting the family that maximises the spread of the probability mass. This method is closely linked to the principle of maximum entropy, which will be discussed in more detail in Chapter 6. It is also helps to explain why exponential families are often preferred in statistical modelling.

Complex statistical models

Statistical analysis often uses models that are composed of many random variables. These models can be complex, with hierarchical or network-like structures that connect observed and latent variables, and potentially nonlinear functional relationships. Even so, the most sophisticated statistical models are constructed from simpler, more fundamental components.

Specifically, the large class of graphical models provide a principled means to form complex joint distributions for observed and unobserved random variables built from more elementary components. This include regression models, mixture models and compound models (continuous version of mixture models) as well as more general network-like and hierarchically structured models.

In these complex models some of the underlying elementary distributions will serve to model the observed output while others represent internal variables or account for the uncertainty regarding a parameter (in a Bayesian context).

In statistical course units in year 3 and year 4 you will discuss and learn about many types of advanced models, related for instance to

  • multivariate statistics and machine learning
  • temporal and spatial modelling, and
  • generalised linear and nonparametric models.

Iterative refinement

Much of statistics is concerned with methods to quantify how well a model fits to data and how well it predicts future observations, and this allows to build successive models and compare them in a systematic and principled fashion.

Finally, it is worth recalling that all distributions (and models in general) are best considered as approximations of the true unknown data-generating process. Hence, the focus of any data analysis will be to find the model that captures the essential properties at an appropriate level of detail1.

2.4 Further reading

For details about the above-mentioned distributions, and their parametrisations, see the supplementary Probability and Distribution Refresher notes.

There you can also find a definition of exponential families.


  1. The fact that it’s possible to model the world at one length scale independently from what’s happening at other length scales is a general phenomenon in nature known in physics as decoupling of scales.↩︎