2 Distributions for statistical models

Choosing the appropriate distributions for statistical modeling is a crucial aspect of probabilistic data analysis. This chapter explores various factors to consider when selecting suitable distributions and also reviews the key distributions covered in this module.

2.1 Common chararacterics of distributions

Distributions can be differentiated by a number of characteristics.

Firstly, by the type of random variable:

discrete versus continuous
univariate versus multivariate

Secondly, by the support of the random variable, with typical ranges such as:

finite discrete support, e.g. $\{1, 2, \ldots, n\}$
infinite discrete support, e.g. $\{1, 2, \ldots\}$
$[0,1]$
$[-\infty, \infty]$
$[0, \infty]$

The choice of support will depend on the intended use of the random variable in the model. Common applications include

proportion
location
scale
mean
variance
spread
concentration
shape
rate
(squared) correlation

These interpretations apply not only to the random variable itself but also to the parameter of a distribution family. For instance, we might select a distribution that allows the samples to be interpreted as proportions (such as the Beta distribution). Alternatively, we may wish to choose a distribution family in which a parameter represents a proportion (such as the Bernoulli distribution).

A third consideration may be the general shape of the distribution:

symmetric or asymmetric
left or right skewed
short tails or long tails
unimodal or multimodal

A further characteristic of a distribution family is the number of parameters, with choices such as

single parameter
multiple parameters
multiple types of parameters (e.g. location+scale)

A distribution family is comprised of a finite or infinite number of distributions corresponding to particular instances of the parameter values.

In data analysis our goal is to employ and develop models that are sufficiently complex to capture the essential features of the observations while also preventing overfitting of the data. As a result, models with fewer parameters are generally preferred over models with larger number of parameters, especially if both have similar explanatory power, i.e. similar capacity to both explain the observed data and to predict future observations.

Lastly, it is important to take into account the general structure of the distribution:

parametric versus nonparametric models
exponential family versus non-exponential family
special exponential families, e.g. Gibbs family, natural exponential family (NEF)

Models with simpler structure can be preferable when the sample size is small and there are fewer observations, and conversely nonparametric approaches with fewer assumptions about the data generating process may be more appropriate when there is an abundance of data.

2.2 Commonly used basic distributions

In this module we will often make use of the following common univariate distributions:

Binomial distribution $\text{Bin}(n, \theta)$, with support $\{0, 1, \ldots, n\}$.

As special case ($n=1)$ is:
- Bernoulli distribution $\text{Ber}(\theta)$, with support $\{0, 1\}$.
Beta distribution $\text{Beta}(\alpha, \beta)$, with support $[0, 1]$.
Normal distribution $N(\mu, \sigma^2)$, with support $[-\infty, \infty]$.
Gamma distribution $\text{Gam}(\alpha, \theta)$, with support $[0, \infty]$. It is also known as univariate Wishart distribution $\text{Wis}\left(s^2, k \right)$.

Special cases of the gamma/Wishart distribution are:
- scaled chi-squared distribution $s^2 \text{$\chi^2_{k}$}$ (discrete $k$)
- chi-squared distribution $\text{$\chi^2_{k}$}$ (discrete $k$, $s^2=1$)
- exponential distribution $\text{Exp}(\theta)$ ($\alpha=1$)
Inverse gamma distribution $\text{IG}(\alpha, \beta)$, with support $[0, \infty]$. Also know as univariate inverse Wishart distribution $\text{IW}(\psi, k)$.

All the above distributions are so-called exponential families. As such they can be written in a particular structural form. Exponential families have many useful properties that facilitate statistical analysis.

Location-scale $t$-distribution $\text{$t_{\nu}$}(\mu, \tau^2)$, with support $[-\infty, \infty]$.

Special cases of the location-scale $t$-distribution are:
- Student’s $t$-distribution $t_\nu$
- Cauchy distribution $\text{Cau}(\mu, \tau)$
The location-scale $t$-distribution is generalisation of the normal distribution but with more probability mass in the tails. Depending on the choice of the degrees of freedom $\nu$, not all moments of the distribution may exist. Furthermore, it’s not an exponential family.

For all of the above univariate distribution there exist corresponding multivariate variants. In this module we will make use of the following multivariate distributions:

Multinomial distribution $\text{Mult}(n, \boldsymbol \pi)$, generalising the binomial distribution.

Special case ($n=1)$:
- Categorical distribution $\text{Cat}(\boldsymbol \pi)$, generalising the Bernoulli distribution.
Multivariate normal distribution $N_d(\boldsymbol \mu, \boldsymbol \Sigma)$, generalising the univariate normal distribution.

A distribution family can be parameterised in multiple equivalent ways. Typically, there is a standard parameterisation, and also a mean parameterisation, where one of the parameters can be interpreted as the mean. Sometimes, the same distribution is referred to by different names and there are various default parameterisations.

Importantly, any parameterisation is a matter of choice and simply provides as an alternative means to index the elementary distributions within the family. However, certain parameterisations may be more interpretable or offer computational advantages.

2.3 Choosing the right distribution

When choosing a distribution for a data model we typically aim to to align the characteristics of the distribution with the observations. For instance, if the data exhibit long tails, we will need to use a long-tailed model. Additionally, there may be a mechanistic rationale, such as derived from physics, that the underlying process adheres to a particular distribution

In many cases, the central limit theorem justifies using a normal distribution.

Another approach to selecting a distribution family involves constraining specific properties of the distribution, such as its mean and variance, and then selecting a model family that maximises the spread of the probability mass. This method is closely related to the maximum entropy principle, which will be discussed in more detail later, and is also in fact one of the reasons why exponential families are favoured in statistical modelling.

2.4 Building complex statistical models

Statistical analysis often utilises models that consist of numerous random variables. In practice, these can be quite intricate, featuring hierarchical or network-like structures that connect observed and latent variables, and may also display nonlinear functional relationships. Despite their complexity, even the most sophisticated statistical models are constructed from more fundamental components.

Specifically, the large class of graphical models provide a principled means to form complex joint distributions for observed and unobserved random variables built from more elementary components. This include regression models, mixture models and compound models (continuous version of mixture models) as well as more general network-like and hierarchically structured models.

In these complex models some of the underlying elementary distributions will serve to model the observed output while others represent internal variables or account for the uncertainty regarding a parameter (in a Bayesian context).

In statistical course units in year 3 and year 4 you will discuss and learn about many types of advanced models, related for instance to

multivariate statistics and machine learning
temporal and spatial modelling, and
generalised linear and nonparametric models.

Much of statistics is concerned with methods to quantify how well a model fits to data and how well it predicts future observations, and this allows to build successive models and compare them in a systematic and principled fashion.

Finally, it is worth recalling that all distributions (and models in general) are best considered as approximations of the true unknown data generating process. Hence, the focus of any data analysis will be to find the model that captures the essential properties at an appropriate level of detail¹.

2.5 Further reading

For details about the above-mentioned distributions, and their parameterisations, see the supplementary Probability and Distribution Refresher notes.

The fact that it’s possible to model the world at one length scale independently from what’s happening at other length scales is a general phenomenon in nature known in physics as decoupling of scales.↩︎