2  Distributions for statistical models

Choosing the appropriate distributions for statistical modeling is a crucial aspect of probabilistic data analysis. This chapter explores various factors to consider when selecting suitable distributions and also reviews the key distributions covered in this module.

2.1 Common chararacterics of distributions

Distribution families can be differentiated by a number of characteristics.

Firstly, by the type of random variable:

  • discrete versus continuous
  • univariate versus multivariate

Secondly, by the support of the random variable, with typical ranges such as:

  • finite or infinitely discrete support
  • \([0,1]\)
  • \([-\infty, \infty]\)
  • \([0, \infty]\)

The choice of support will depend on the intended use of the random variable in the model, with common interpretations including

  • proportion
  • location
  • scale
  • mean
  • variance
  • spread
  • concentration
  • shape
  • rate
  • (squared) correlation

These interpretations pertain not only to the random variable itself but also to the parameter of a distribution. For instance, we might select a distribution that allows the samples to be interpreted as proportions (such as the Beta distribution). Alternatively, we could choose a distribution family in which a parameter represents a proportion (like the Bernoulli distribution).

A third consideration may be the general shape of the distribution:

  • symmetric or asymmetric
  • left or right skewed
  • short tails or long tails
  • unimodal or multimodal

A further characteristic of a distribution family is the number of parameters, with choices such as

  • no parameter
  • single parameter
  • multiple parameters
  • multiple types of parameters (e.g. location+scale)

In data analysis our goal is to use a model that is complex enough to capture the essential features of the observations while also preventing overfitting to the data.

Lastly, it is important to take into account the general structure of the distribution:

  • parametric versus nonparametric models
  • exponential family versus non-exponential family
  • special exponential families, e.g. Gibbs family, natural exponential family (NEF)

Models with simpler structure can be preferable when the sample size is small and there are fewer observations, and conversely nonparametric approaches with fewer assumptions about the data generating process may be more appropriate when there is an abundance of data.

2.2 Commonly used elementary distributions

In this module we will often make use of the following common univariate distributions:

  1. Binomial distribution \(\text{Bin}(n, \theta)\), with support \(\{0, 1, \ldots, n\}\).

    As special case (\(n=1)\) is:

    • Bernoulli distribution \(\text{Ber}(\theta)\), with support \(\{0, 1\}\).
  2. Beta distribution \(\text{Beta}(\alpha, \beta)\), with support \([0, 1]\).

  3. Normal distribution \(N(\mu, \sigma^2)\), with support \([-\infty, \infty]\).

  4. Gamma distribution \(\text{Gam}(\alpha, \theta)\), with support \([0, \infty]\). It is also known as univariate Wishart distribution \(\text{Wis}\left(s^2, k \right)\).

    Special cases of the gamma/Wishart distribution are:

    • scaled chi-squared distribution \(s^2 \text{$\chi^2_{k}$}\) (discrete \(k\))
    • chi-squared distribution \(\text{$\chi^2_{k}$}\) (discrete \(k\), \(s^2=1\))
    • exponential distribution \(\text{Exp}(\theta)\) (\(\alpha=1\))
  5. Inverse gamma distribution \(\text{IG}(\alpha, \beta)\), with support \([0, \infty]\). Also know as univariate inverse Wishart distribution \(\text{IW}(\psi, k)\).

All the above distributions are so-called exponential families. As such they can be written in a particular structural form. Exponential families have many useful properties that facilitate statistical analysis.

  1. Location-scale \(t\)-distribution \(\text{$t_{\nu}$}(\mu, \tau^2)\), with support \([-\infty, \infty]\).

    Special cases of the location-scale \(t\)-distribution are:

    • Student’s \(t\)-distribution \(t_\nu\)
    • Cauchy distribution \(\text{Cau}(\mu, \tau)\)

    The location-scale \(t\)-distribution is generalisation of the normal distribution but with more probability mass in the tails. Depending on the choice of the degrees of freedom \(\nu\), not all moments of the distribution may exist. Furthermore, it’s not an exponential family.

For all of the above univariate distribution there exist corresponding multivariate variants. In this module we will make use of the following multivariate distributions:

  1. Multinomial distribution \(\text{Mult}(n, \boldsymbol \pi)\), generalising the binomial distribution.

    Special case (\(n=1)\):

    • Categorical distribution \(\text{Cat}(\boldsymbol \pi)\), generalising the Bernoulli distribution.
  2. Multivariate normal distribution \(N_d(\boldsymbol \mu, \boldsymbol \Sigma)\), generalising the univariate normal distribution.

A distribution family can be parameterised in multiple equivalent ways. Typically, there is a standard parameterisation, and also a mean parameterisation, where one of the parameters can be interpreted as the mean. Sometimes, the same distribution is referred to by different names and there are various default parameterisations.

Importantly, any parameterisation is a matter of choice and simply serves as an alternative method to index the distributions within the family. However, certain parameterisations may be more interpretable or offer computational advantages.

2.3 Choosing the right distribution

When choosing a distribution for a data model we typically aim to to align the characteristics of the distribution with the observations. For instance, if the data exhibit long tails, we will need to use a long-tailed model. Additionally, there may be a mechanistic rationale, such as derived from physics, that the underlying process adheres to a particular distribution

In many cases, the central limit theorem justifies using a normal distribution.

Another approach to selecting a distribution family involves constraining specific properties of the distribution, such as its mean and variance, and then selecting a model family that maximises the spread of the probability mass. This method is closely related to the maximum entropy principle, which will be discussed in more detail later, and is also one of the reasons why exponential families are favoured in statistical modelling.

2.4 Building complex statistical models

Statistical analysis often utilises models that consist of numerous random variables. In practice, these can be quite intricate, featuring hierarchical or network-like structures that connect observed and latent variables, and may also display nonlinear functional relationships. Despite their complexity, even the most sophisticated statistical models are constructed from more fundamental components.

Specifically, the large class of graphical models provide a principled means to form complex joint distributions for observed and unobserved random variables built from simple components. This include regression models, mixture models and compound models (continuous version of mixture models) as well as more general network-like and hierarchically structured models.

Some of the underlying elementary distributions in these models will serve to model the observed output while others represent internal variables or account for the uncertainty regarding a parameter (in a Bayesian context).

In statistical course units in year 3 and year 4 you will discuss and learn about many types of advanced models, related for instance to

  • multivariate statistics and machine learning
  • temporal and spatial modelling, and
  • generalised linear and nonparametric models.

Much of statistics is concerned with methods to quantify how well a model fits to data and how well it predicts future observations, and this allows to build successive models and compare them in a systematic and principled fashion.

Finally, it is worth recalling that all distributions (and models in general) are best considered as approximations of the true unknown data generating process. Hence, the focus of any data analysis will be to find the model that captures the essential properties at an appropriate level of detail1.

2.5 Further reading

For details about the above-mentioned distributions, and their parameterisations, see the supplementary Probability and Distribution Refresher notes.


  1. The fact that it’s possible to model the world at one length scale independently from what’s happening at other length scales is a general phenomenon in nature known in physics as decoupling of scales.↩︎