2 Probability
2.1 Random variables
A random variable describes a random experiment. The set of all possible outcomes is the sample space of the random variable and is denoted by \(\Omega\). If \(\Omega\) is countable then the random variable is discrete, otherwise it is continuous. For a discrete random variable the sample space \(\Omega = \{\omega_1, \omega_2, \ldots\}\) is composed of a finite or infinite number of elementary outcomes \(\omega_i\).
An event \(A \subseteq \Omega\) is a subset of \(\Omega\). This includes as special cases the complete set \(\Omega\) (“certain event”) and the empty set \(\emptyset\) (“impossible event”). The set of all possible events is denoted by \(\mathcal{F}\). The complementary event \(A^C = \Omega \setminus A\) is the complement of the set \(A\) in the sample space \(\Omega\). Two events \(A_1\) and \(A_2\) are mutually exclusive if the sets are disjoint with \(A_1 \cap A_2 = \emptyset\).
For a discrete random variable, the elementary outcomes \(\omega_i\) are referred to as elementary events, and they are all mutually exclusive. An event \(A\) consists of a number of elementary events \(\omega_i \in A\) and the complementary event is given by \(A^C = \{\omega_i \in \Omega: \omega_i \notin A\}\).
The probability of an event \(A\) is denoted by \(\text{Pr}(A)\). Broadly, \(\text{Pr}(A)\) provides a measure of the size of the set \(A\) relative to the set \(\Omega\). The probability measure \(\text{Pr}(A)\) satisfies the three axioms of probability:
- \(\text{Pr}(A) \geq 0\), probabilities are non-negative,
- \(\text{Pr}(\Omega) = 1\), the certain event has probability 1, and
- \(\text{Pr}(A_1 \cup A_2 \cup \ldots) = \sum_i \text{Pr}(A_i)\), the probability of countable mutually exclusive events \(A_i\) is additive.
This implies
- \(\text{Pr}(A) \leq 1\), probability values lie within the range \([0,1]\),
- \(\text{Pr}(A^C) = 1 - \text{Pr}(A)\), the probability of the complement, and
- \(\text{Pr}(\emptyset) = 0\), the impossible event has probability 0.
From the above it is evident that probability is closely linked to set theory, in particular to measure theory which serves as the theoretical foundations of probability and generalisations. For instance, if \(\text{Pr}(\emptyset) = 0\) is assumed instead of \(\text{Pr}(\Omega) = 1\), this leads to the axioms for a positive measure (of which probability is a special case).
2.2 Conditional probability
Consider two events \(A\) and \(B\), which may not be be mutually exclusive. The probability of the event “\(A\) and \(B\)” is given by the probability of the set intersection \(\text{Pr}(A \cap B)\). The probability of the event “\(A\) or \(B\)” is given by the probability of the set union \[ \text{Pr}(A \cup B) = \text{Pr}(A) + \text{Pr}(B) - \text{Pr}(A \cap B)\,. \] This identity follows from the axioms.
The conditional probability of event \(A\) assuming event \(B\) has occurred is given by \[ \text{Pr}(A | B) = {\text{Pr}( A \cap B) \over \text{Pr}(B)} \] Essentially, now \(B\) acts as the new sample space relative to which \(A\) is measured, restricting it from \(\Omega\). Note that \(\text{Pr}(A | B)\) is generally not the same as \(\text{Pr}(B | A)\), see Bayes’ theorem below.
Importantly, it can be seen that any probability may be viewed as conditional, namely relative to \(\Omega\) as \(\text{Pr}(A) = \text{Pr}(A| \Omega)\).
From the definition of conditional probability we derive the product rule \[ \begin{split} \text{Pr}( A \cap B) &= \text{Pr}(A | B)\, \text{Pr}(B) \\ &= \text{Pr}(B | A)\, \text{Pr}(A) \end{split} \] which in turn yields Bayes’ theorem \[ \text{Pr}(A | B ) = \text{Pr}(B | A) { \text{Pr}(A) \over \text{Pr}(B)} \] This theorem is useful for changing the order of conditioning and it plays a key role in Bayesian statistics.
If \(\text{Pr}( A \cap B) = \text{Pr}(A) \, \text{Pr}(B)\) then the two events \(A\) and \(B\) are independent with \(\text{Pr}(A | B ) = \text{Pr}(A)\) and \(\text{Pr}(B | A ) = \text{Pr}(B)\).
2.3 Probability mass and density function
To describe a random variable \(x\) with associated sample space \(\Omega\) and event space \(\mathcal{F}\) we use probability mass and density functions as a means to effectively work with the corresponding probabilities.
We use the same symbol to denote the random variable and the corresponding observations.1 This notation facilitates working with multivariate random variables and is standard in many research papers in statistical machine learning and in multivariate statistics, see for instance the classic textbook by Mardia, Kent, and Bibby (1979).
For a discrete random variable we define the event \(A = \{x: x=a\} = \{a\}\) and get the probability \[ \text{Pr}(A) = \text{Pr}(x=a) = f(a) \] directly from the probability mass function (pmf), here denoted by lower case \(f\) (but we frequently also use \(p\) or \(q\)). The pmf has the property that \(\sum_{x \in \Omega} f(x) = 1\) and that \(f(x) \in [0,1]\).
For continuous random variables we need to use a probability density function (pdf) instead. We define the event \(A = \{x: a < x \leq a + da\}\) as an infinitesimal interval and then assign the probability \[ \text{Pr}(A) = \text{Pr}( a < x \leq a + da) = f(a) da \,. \] The pdf has the property that \(\int_{x \in \Omega} f(x) dx = 1\) but in contrast to a pmf the density \(f(x)\geq 0\) may take on values larger than 1.
The set of all \(x\) for which \(f(x)\) is positive is called the support of the pmf or pdf.
It is sometimes convenient to refer to a pdf or pmf as probability density mass function (pdmf). without specifying whether \(x\) is a continuous or discrete random variable.

Figure 2.1 (first row) illustrates the pdmf for a continuous and discrete random variable.
2.4 Distribution function
As alternative to using the pdmf we may use a distribution function to describe the random variable. This assumes that an ordering exist so that we can define the event \(A = \{x: x \leq a \}\) and compute its probability as \[ F(a) = \text{Pr}(A) = \text{Pr}( x \leq a ) = \begin{cases} \sum_{x \in A} f(x) & \text{discrete case} \\ \int_{x \in A} f(x) dx & \text{continuous case} \\ \end{cases} \] This is known as cumulative distribution function (cdf) and is denoted by upper case \(F\) (or \(P\) and \(Q\)). By construction the distribution function is monotonically non-decreasing and its value ranges from 0 to 1. For a discrete random variable the distribution function \(F(a)\) is a step function with jumps of size \(f(\omega_i)\) at the elementary outcomes \(\omega_i\).
If the random variable \(x\) has distribution function \(F\) we write \(x \sim F\).
With its help we can compute the probability of an interval as \[ \text{Pr}( a_1 < x \leq a_2 ) = F(a_2)-F(a_1) \,. \]
Figure 2.1 (second row) illustrates the distribution function for a continuous and discrete random variable.
2.5 Quantile function and quantiles
The quantile function is defined as \(q_F(b) = \min\{ x: F(x) \geq b \}\). For a continuous random variable the quantile function simplifies to \(q_F(b) = F^{-1}(b)\), i.e. it is the ordinary inverse \(F^{-1}(b)\) of the distribution function.
Figure 2.1 (third row) illustrates the quantile function for a continuous and discrete random variable.
The quantile \(x\) of order \(b\) of the distribution \(F\) is often denoted by \(x_b= q_F(b)\).
The 25% quantile \(x_{0.25} = x_{25\%} = q_F(1/4)\) is called the first quartile or lower quartile.
The 50% quantile \(x_{0.5} = x_{50\%} = q_F(1/2)\) is called the second quartile or median.
The 75% quantile \(x_{0.75} = x_{75\%} = q_F(3/4)\) is called the third quartile or upper quartile.
The interquartile range is the difference between the upper and lower quartiles and equals \(\text{IQR}(F) = q_F(3/4) - q_F(1/4)\).
The quantile function is also useful for generating general random variates from uniform random variates. If \(y\sim \text{Unif}(0,1)\) then \(x=q_F(y) \sim F\).
2.6 Families of distributions
A distribution \(F_{\theta}\) with a parameter \(\theta\) constitutes a distribution family collecting all the distributions corresponding to particular instances of the parameter. The parameter \(\theta\) therefore acts as an index of the distributions contained in the family.
The corresponding pdmf is written either as \(f_{\theta}(x)\), \(f(x; \theta)\) or \(f(x | \theta)\). The latter form is the most general is it suggests that the parameter \(\theta\) may potentially also have its own distribution, with a joint density formed by \(f(x, \theta) = f(x | \theta) f(\theta)\).
Note that any parametrisation is generally not unique, as a one-to-one transformation of \(\theta\) will yield another equivalent index to the same distribution family. Typically, for most commonly used distribution families there exist several standard parametrisations. Often we prefer to use those parametrisations where the parameters can be interpreted easily (e.g. in terms of moments) or that simplify calculations.
If for any pair of different parameter values \(\theta_1 \neq \theta_2\) we get distinct distributions with \(F_{\theta_1} \neq F_{\theta_2}\) then the distribution family \(F_{\theta}\) is said to be identifiable by the parameter \(\theta\).
2.7 Expectation or mean
The expected value \(\text{E}(x)\) of a random variable \(x\) is defined as the weighted average over all possible outcomes, with the weight given by the pdmf \(f(x)\): \[ \text{E}(x) = \begin{cases} \sum_{x \in \Omega} f(x) \, x & \text{discrete case} \\ \int_{x \in \Omega} f(x) \, x \, dx & \text{continuous case} \\ \end{cases} \] We may can also write \(\text{E}_{F}(x)\) as a reminder that the expectation is taken with regard to the distribution \(F\). Usually, the subscript \(F\) is left out if there are no ambiguities. A further variant is to write the expectation as \(\text{E}(F)\) to indicate that we are computing the mean of the distribution \(F\).
Because the sum or integral may diverge, not all distributions have finite means so the mean does not always exist (in contrast to the median, or quantiles in general). For example, the location-scale \(t\)-distribution \(\text{$t_{\nu}$}(\mu, \tau^2)\) does not have a mean for a degree of freedom in the range \(0 < \nu \leq 1\) (see Section 4.6).
2.8 Expectation of a transformed random variable
Often, one needs to find the mean of a transformed random variable. If \(x\sim F_x\) and \(y= h(x)\) with \(y \sim F_y\) then one can directly apply the above definition to obtain \(\text{E}(y) = \text{E}(F_y)\). However, this requires knowledge of the transformed pdmf \(f_y(y)\) (see Chapter 3 for more details about variable transformations).
As an alternative, the “law of the unconscious statistician”(LOTUS) provides a convenient shortcut to compute the mean of the transformed random variable \(y=h(x)\) using only the pdmf of the original variable \(x\): \[ \text{E}(h(x)) = \begin{cases} \sum_{x \in \Omega} f(x) \, h(x) & \text{discrete case} \\ \int_{x \in \Omega} f(x) \, h(x) \, dx & \text{continuous case} \\ \end{cases} \] Note this is not an approximation but equivalent to obtaining the mean using the transformed pdmf.
2.9 Variance
The variance of a random variable \(x\) is the expected value of the squared deviation around the mean: \[
\text{Var}(x) = \text{E}\left( (x - \text{E}(x))^2 \right)
\] By construction, \(\text{Var}(x) \geq 0\). The variance can also be obtained by \[
\text{Var}(x) = \text{E}(x^2)-\text{E}(x)^2
\]
Occasionally we write \(\text{Var}_F(x)\) to express that the expectation is taken with regard to the distribution \(F\). The alternative notation \(\text{Var}(F)\) highlights that we are computing the variance of the distribution \(F\).
Like the mean, the variance may diverge and hence not necessarily exists for all distribution. For example, the location-scale \(t\)-distribution \(\text{$t_{\nu}$}(\mu, \tau^2)\) does not have a variance for the degree of freedom in the range \(0 < \nu \leq 2\) (see Section 4.6).
2.10 Moments of a distribution
The \(n\)-th moment of a distribution \(F\) for a random variable \(x\) is defined as follows: \[ \mu_n(F) = \text{E}(x^n) \]
Special important cases are the
- Zeroth moment: \(\mu_0(F) = \text{E}(x^0) = 1\) (since the pdmf integrates to one)
- First moment: \(\mu_1(F) = \text{E}(x^1) = \text{E}(x) = \mu\) (=the mean)
- Second moment: \(\mu_2(F) = \text{E}(x^2)\)
The \(n\)-th central moment centred around the mean \(\text{E}(x) = \mu\) is given by \[ m_n(F) = \text{E}((x-\mu)^n) \]
The first few central moments are the
- Zeroth central moment: \(m_0(F) = \text{E}((x-\mu)^0) = 1\)
- First central moment: \(m_1(F) = \text{E}((x-\mu)^1) = 0\)
- Second central moment: \(m_2(F) = \text{E}\left( (x - \mu)^2 \right)\) (=the variance)
The moments of a distribution are not necessarily all finite, i.e. some moments may not exist. For example, the location-scale \(t\)-distribution \(\text{$t_{\nu}$}(\mu, \tau^2)\) only has finite moments of degree smaller than the degree of freedom \(\nu\) (see Section 4.6).
2.11 Jensen’s inequality for the expectation
If \(h(\boldsymbol x)\) is a convex function then the following inequality holds:
\[ \text{E}(h(\boldsymbol x)) \geq h(\text{E}(\boldsymbol x)) \]
Recall: a convex function (such as \(x^2\)) has the shape of a “valley”.
An example of Jensen’s inequality is \(\text{E}(x^2)\geq \text{E}(x)^2\).
2.12 Probability as expectation
Probability itself can also be understood as an expectation. For an event \(A\) we can define a corresponding indicator function \([x \in A]\) for an elementary element \(x\) to be part of \(A\). From the above it then follows \[ \text{E}\left( \left[x \in A\right] \right) = \text{Pr}(A) \] This relation is called the “fundamental bridge” between probability and expectation.
Interestingly, one can develop the whole theory of probability from this perspective (e.g., Whittle 2000).
2.13 Random vectors and their mean and variance
In addition to scalar random variables we often make use of random vectors and also random matrices.2
For a random vector \(\boldsymbol x= (x_1, x_2,...,x_d)^T \sim F\) the mean \(\text{E}(\boldsymbol x) = \boldsymbol \mu\) is given by the means of its elements, i.e. \(\boldsymbol \mu= (\mu_1, \ldots, \mu_d)^T\) with \(\mu_i = \text{E}(x_i)\). Thus, the mean of a random vector of dimension \(d\) is a vector of the same length.
The variance of a random vector of length \(d\), however, is not a vector but a matrix of size \(d\times d\). This matrix is called the covariance matrix: \[ \begin{split} \text{Var}(\boldsymbol x) &= \underbrace{\boldsymbol \Sigma}_{d\times d} = (\sigma_{ij}) \\ &= \begin{pmatrix} \sigma_{11} & \dots & \sigma_{1d}\\ \vdots & \ddots & \vdots \\ \sigma_{d1} & \dots & \sigma_{dd} \end{pmatrix} \\ &=\text{E}\left(\underbrace{(\boldsymbol x-\boldsymbol \mu)}_{d\times 1} \underbrace{(\boldsymbol x-\boldsymbol \mu)^T}_{1\times d}\right) \\ & = \text{E}(\boldsymbol x\boldsymbol x^T)-\boldsymbol \mu\boldsymbol \mu^T \\ \end{split} \] The entries of the covariance matrix \(\text{Cov}(x_i, x_j)=\sigma_{ij}\) describe the covariance between the random variables \(x_i\) and \(x_j\). The covariance matrix is symmetric, hence \(\sigma_{ij}=\sigma_{ji}\). The diagonal entries \(\text{Cov}(x_i, x_i)=\sigma_{ii}\) correspond to the variances \(\text{Var}(x_i) = \sigma_i^2\) of the elements of \(\boldsymbol x\). The covariance matrix is by construction positive semi-definite, i.e. the eigenvalues of \(\boldsymbol \Sigma\) are all positive or equal to zero.
However, wherever possible one will aim to use models with non-singular covariance matrices, with all eigenvalues positive, so that the covariance matrix is invertible.
2.14 Correlation matrix
The correlation matrix \(\boldsymbol P\) (“upper case rho”, not “upper case p”) is the variance standardised version of the covariance matrix \(\boldsymbol \Sigma\).
Specifically, denote by \(\boldsymbol V\) the diagonal matrix containing the variances \[ \boldsymbol V= \begin{pmatrix} \sigma_{11} & \dots & 0\\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma_{dd} \end{pmatrix} \] then the correlation matrix \(\boldsymbol P\) is given by \[ \boldsymbol P= (\rho_{ij}) = \begin{pmatrix} 1 & \dots & \rho_{1d}\\ \vdots & \ddots & \vdots \\ \rho_{d1} & \dots & 1 \end{pmatrix} = \boldsymbol V^{-1/2} \, \boldsymbol \Sigma\, \boldsymbol V^{-1/2} \] Like the covariance matrix the correlation matrix is symmetric. The elements of the diagonal of \(\boldsymbol P\) are all set to 1.
Equivalently, in component notation the correlation between \(x_i\) and \(x_j\) is given by \[ \rho_{ij} = \text{Cor}(x_i,x_j) = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii}\sigma_{jj}}} \]
Using the above, a covariance matrix can be factorised into the product of standard deviations \(\boldsymbol V^{1/2}\) and the correlation matrix as follows: \[ \boldsymbol \Sigma= \boldsymbol V^{1/2}\, \boldsymbol P\,\boldsymbol V^{1/2} \]
For scalar random variables many texts use upper case to designate the random variable and lower case for its realisations. However, this convention breaks down in multivariate statistics when dealing with random vectors and random matrices. This upper-lower case notation also doesn’t work well in Bayesian statistics where random variables describe the uncertainty of parameters. Hence, we use upper case (bold font) to indicate a matrix quantity. Upper case (plain font) may denote sets and some scalar quantities traditionally written in upper case (e.g. \(R^2\), \(K\)).↩︎
In our notational conventions, a vector \(\boldsymbol x\) is written in lower case bold font, a matrix \(\boldsymbol M\) in upper case bold font. Hence random vectors and matrices as well as their realisations are indicated in bold font, with vectors given in lower case and matrices in upper case. Hence, as for scalar variables, upper versus lower case does not indicate randomness versus realisation.↩︎