2  Probability

2.1 Random variables

A random variable describes a random experiment. The set of all possible outcomes is the sample space or state space of the random variable and is denoted by \(\Omega = \{\omega_1, \omega_2, \ldots\}\). The outcomes \(\omega_i\) are the elementary events. The sample space \(\Omega\) can be finite or infinite. Depending on type of outcomes the random variable is discrete or continuous.

An event \(A \subseteq \Omega\) is a subset of \(\Omega\) and thus itself a set composed of elementary events: \(A = \{a_1, a_2, \ldots\}\). This includes as special cases the full set \(A = \Omega\), the empty set \(A = \emptyset\), and the elementary events \(A=\omega_i\). The complementary event \(A^C\) is the complement of the set \(A\) in the set \(\Omega\) so that \(A^C = \Omega \setminus A = \{\omega_i \in \Omega: \omega_i \notin A\}\).

The probability of an event \(A\) is denoted by \(\text{Pr}(A)\). Essentially, to obtain this probability we need to count the elementary elements corresponding to \(A\). To do this we assume as axioms of probability that

  • \(\text{Pr}(A) \geq 0\), probabilities are positive,
  • \(\text{Pr}(\Omega) = 1\), the certain event has probability 1, and
  • \(\text{Pr}(A) = \sum_{a_i \in A} \text{Pr}(a_i)\), the probability of an event equals the sum of its constituting elementary events \(a_i\). This sum is taken over a finite or countable infinite number of elements.

This implies

  • \(\text{Pr}(A) \leq 1\), i.e. probabilities all lie in the interval \([0,1]\)
  • \(\text{Pr}(A^C) = 1 - \text{Pr}(A)\), and
  • \(\text{Pr}(\emptyset) = 0\)

Assume now that we have two events \(A\) and \(B\). The probability of the event “\(A\) and \(B\)” is then given by the probability of the set intersection \(\text{Pr}(A \cap B)\). Likewise the probability of the event “\(A\) or \(B\)” is given by the probability of the set union \(\text{Pr}(A \cup B)\).

From the above it is clear that the definition and theory of probability is closely linked to set theory, and in particular to measure theory. Indeed, viewing probability as a special type of measure allows for an elegant treatment of both discrete and continuous random variables.

2.2 Probability mass and density function

To describe a random variable \(x\) with state space \(\Omega\) we need a way to effectively store the probabilities of the corresponding elementary outcomes \(x \in \Omega\).

For simplicity of notation we use the same symbol to denote the random variable and its elementary outcomes.1 This convention greatly facilitates working with random vectors and matrices and follows, e.g., the classic multivariate statistics textbook by Mardia, Kent, and Bibby (1979). If a quantity is random we will always specify this explicitly in the context.

For a discrete random variable we define the event \(A = \{x: x=a\} = \{a\}\) and get the probability \[ \text{Pr}(A) = \text{Pr}(x=a) = f(a) \] directly from the probability mass function (pmf), here denoted by lower case \(f\) (but we frequently also use \(p\) or \(q\)). The pmf has the property that \(\sum_{x \in \Omega} f(x) = 1\) and that \(f(x) \in [0,1]\).

For continuous random variables we need to use a probability density function (pdf) instead. We define the event \(A = \{x: a < x \leq a + da\}\) as an infinitesimal interval and then assign the probability \[ \text{Pr}(A) = \text{Pr}( a < x \leq a + da) = f(a) da \,. \] The pdf has the property that \(\int_{x \in \Omega} f(x) dx = 1\) but in contrast to a pmf the density \(f(x)\geq 0\) may take on values larger than 1.

The set of all \(x\) for which \(f(x)\) is positive is called the support of the pmf or pdf.

It is sometimes convenient to refer to a pdf or pmf without specifying whether \(x\) is continous or discrete as probability density mass function (pdmf).

2.3 Distribution function and quantile function

As alternative to using the pdmf we may use a distribution function to describe the random variable. This assumes that an ordering exist among the elementary events so that we can define the event \(A = \{x: x \leq a \}\) and compute its probability as \[ F(a) = \text{Pr}(A) = \text{Pr}( x \leq a ) = \begin{cases} \sum_{x \in A} f(x) & \text{discrete case} \\ \int_{x \in A} f(x) dx & \text{continuous case} \\ \end{cases} \] This is also known cumulative distribution function (cdf) and is denoted by upper case \(F\) (or \(P\) and \(Q\)). By construction the distribution function is monotonically non-decreasing and its value ranges from 0 to 1. With its help we can compute the probability of an interval set such as \[ \text{Pr}( a < x \leq b ) = F(b)-F(a) \,. \]

The inverse of the distribution function \(y=F(x)\) is the quantile function \(x=F^{-1}(y)\). The 50% quantile \(F^{-1}\left(\frac{1}{2}\right)\) is called the median.

If the random variable \(x\) has distribution function \(F\) we write \(x \sim F\).

Figure 2.1: Density function and distribution function.

Figure 2.1 illustrates a density function \(f(x)\) and the corresponding distribution function \(F(x)\).

2.4 Families of distributions

A distribution \(F_{\theta}\) with a parameter \(\theta\) constitutes a distribution family collecting all the distributions corresponding to particular instances of the parameter. The parameter \(\theta\) therefore acts as an index of the distributions contained in the family.

The corresponding pdmf is written either as \(f_{\theta}(x)\), \(f(x; \theta)\) or \(f(x | \theta)\). The latter form is the most general is it suggests that the parameter \(\theta\) may potentially also have its own distribution, with a joint density formed by \(f(x, \theta) = f(x | \theta) f(\theta)\).

Note that any parametrisation is generally not unique, as a one-to-one transformation of \(\theta\) will yield another equivalent index to the same distribution family. Typically, for most commonly used distribution families there are several standard parametrisations. Often we use those parametrisations where the parameters can be interpreted easily (e.g. in terms of moments).

If for any pair of different parameter values \(\theta_1 \neq \theta_2\) we get distinct distributions with \(F_{\theta_1} \neq F_{\theta_2}\) then the distribution family \(F_{\theta}\) is said to be identifiable by the parameter \(\theta\).

2.5 Expectation of a random variable

The expected value \(\text{E}(x)\) of a random variable is defined as the weighted average over all possible outcomes, with the weight given by the pdmf \(f(x)\): \[ \text{E}_{F}(x) = \begin{cases} \sum_{x \in \Omega} x f(x) & \text{discrete case} \\ \int_{x \in \Omega} x f(x) dx & \text{continuous case} \\ \end{cases} \] Note the notation to emphasise that the expectation is taken with regard to the distribution \(F\). The subscript \(F\) is usually left out if there are no ambiguities. Furthermore, because the sum or integral may diverge the expectation is not necessarily always defined (in contrast to quantiles).

The expected value of a function of a random variable \(h(x)\) is obtained similarly: \[ \text{E}_{F}(h(x)) = \begin{cases} \sum_{x \in \Omega} h(x) f(x) & \text{discrete case} \\ \int_{x \in \Omega} h(x) f(x) dx & \text{continuous case} \\ \end{cases} \] This is called the “law of the unconscious statistician”, or short LOTUS. Again, to highlight that the random variable \(x\) has distribution \(F\) we write \(\text{E}_F(h(x))\).

2.6 Jensen’s inequality for the expectation

If \(h(\boldsymbol x)\) is a convex function then the following inequality holds:

\[ \text{E}(h(\boldsymbol x)) \geq h(\text{E}(\boldsymbol x)) \]

Recall: a convex function (such as \(x^2\)) has the shape of a “valley”.

2.7 Probability as expectation

Probability itself can also be understood as an expectation. For an event \(A\) we can define a corresponding indicator function \(1_{ x \in A}\) for an elementary element \(x\) to be part of \(A\). From the above it then follows \[ \text{E}( 1_{x \in A} ) = \text{Pr}(A) \, , \]

Interestingly, one can develop the whole theory of probability from this perspective (e.g., Whittle 2000).

2.8 Moments and variance of a random variable

The moments of a random variable are defined as follows:

  • Zeroth moment: \(\text{E}(x^0) = 1\) by construction of a pdmf,
  • First moment: \(\text{E}(x^1) = \text{E}(x) = \mu\) , the mean,
  • Second moment: \(\text{E}(x^2)\)
  • The variance is the second moment centred about the mean \(\mu\): \[\text{Var}(x) = \text{E}\left( (x - \mu)^2 \right) = \sigma^2\]
  • The variance can also be computed by \(\text{Var}(x) = \text{E}(x^2)-\text{E}(x)^2\). This provides an example of Jensen’s inequality, with \(\text{E}(x^2) =\text{E}(x)^2 + \text{Var}(x) \geq \text{E}(x)^2\).

A distribution does not necessarily need to have any finite first or higher moments. An example is the location-scale \(t\)-distribution (Section 4.7.1) that depending on the value of the parameter \(\nu\) may not have a mean or variance (or other higher moments).

2.9 Random vectors and their mean and variance

In addition to scalar random variables we often make use of random vectors and also random matrices.2

For a random vector \(\boldsymbol x= (x_1, x_2,...,x_d)^T \sim F\) the mean \(\text{E}(\boldsymbol x) = \boldsymbol \mu\) is given by the means of its components, i.e. \(\boldsymbol \mu= (\mu_1, \ldots, \mu_d)^T\) with \(\mu_i = \text{E}(x_i)\). Thus, the mean of a random vector of dimension \(d\) is a vector of the same length.

The variance of a random vector of length \(d\), however, is not a vector but a matrix of size \(d\times d\). This matrix is called the covariance matrix: \[ \begin{split} \text{Var}(\boldsymbol x) &= \underbrace{\boldsymbol \Sigma}_{d\times d} = (\sigma_{ij}) = \begin{pmatrix} \sigma_{11} & \dots & \sigma_{1d}\\ \vdots & \ddots & \vdots \\ \sigma_{d1} & \dots & \sigma_{dd} \end{pmatrix} \\ &=\text{E}\left(\underbrace{(\boldsymbol x-\boldsymbol \mu)}_{d\times 1} \underbrace{(\boldsymbol x-\boldsymbol \mu)^T}_{1\times d}\right) \\ & = \text{E}(\boldsymbol x\boldsymbol x^T)-\boldsymbol \mu\boldsymbol \mu^T \\ \end{split} \] The entries of the covariance matrix \(\text{Cov}(x_i, x_j)=\sigma_{ij}\) describe the covariance between the random variables \(x_i\) and \(x_j\). The covariance matrix is symmetric, hence \(\sigma_{ij}=\sigma_{ji}\). The diagonal entries \(\text{Cov}(x_i, x_i)=\sigma_{ii}\) correspond to the variances \(\text{Var}(x_i) = \sigma_i^2\) of the components of \(\boldsymbol x\). The covariance matrix is by construction positive semi-definite, i.e. the eigenvalues of \(\boldsymbol \Sigma\) are all positive or equal to zero.

However, wherever possible one will aim to use models with non-singular covariance matrices, with all eigenvalues positive, so that the covariance matrix is invertible.

2.10 Correlation matrix

The correlation matrix \(\boldsymbol P\) (“upper case rho”, not “upper case p”) is the variance standardised version of the covariance matrix \(\boldsymbol \Sigma\).

Specifically, denote by \(\boldsymbol V\) the diagonal matrix containing the variances \[ \boldsymbol V= \begin{pmatrix} \sigma_{11} & \dots & 0\\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma_{dd} \end{pmatrix} \] then the correlation matrix \(\boldsymbol P\) is given by \[ \boldsymbol P= (\rho_{ij}) = \begin{pmatrix} 1 & \dots & \rho_{1d}\\ \vdots & \ddots & \vdots \\ \rho_{d1} & \dots & 1 \end{pmatrix} = \boldsymbol V^{-1/2} \, \boldsymbol \Sigma\, \boldsymbol V^{-1/2} \] Like the covariance matrix the correlation matrix is symmetric. The elements of the diagonal of \(\boldsymbol P\) are all set to 1.

Equivalently, in component notation the correlation between \(x_i\) and \(x_j\) is given by \[ \rho_{ij} = \text{Cor}(x_i,x_j) = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii}\sigma_{jj}}} \]

Using the above, a covariance matrix can be factorised into the product of standard deviations \(\boldsymbol V^{1/2}\) and the correlation matrix as follows: \[ \boldsymbol \Sigma= \boldsymbol V^{1/2}\, \boldsymbol P\,\boldsymbol V^{1/2} \]


  1. For scalar random variables many texts use upper case to designate the random variable and lower case for its realisations. However, this convention quickly breaks down in multivariate statistics when dealing with random vectors and random matrices. Hence, we use upper case primarily to indicate a matrix quantity (in bold type). Upper case (in plain type) may denote sets and some scalar quantities traditionally written in upper case (e.g. \(R^2\), \(K\)).↩︎

  2. In our notational conventions, a vector \(\boldsymbol x\) is written in lower case in bold type, a matrix \(\boldsymbol M\) in upper case in bold type. Hence random vectors and matrices as well as their realisations are indicated in bold type, with vectors given in lower case and matrices in upper case. Hence, as for scalar variables, upper vs. lower case does not indicate randomness vs. realisation.↩︎