3  Transformations

3.1 Affine or location-scale transformation of random variables

Suppose \(x \sim F_x\) is a scalar random variable. The random variable \[y= a + b x\] is a location-scale transformation or affine transformation of \(x\), where \(a\) plays the role of the location parameter and \(b\) is the scale parameter. For \(a=0\) this is a linear transformation. If \(b\neq 0\) then the transformation is invertible, with back-transformation \[x = (y-a)/b\] Invertible transformations provide a one-to-one map between \(x\) and \(y\).

For a random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) of dimension \(d\) the location-scale transformation is \[ \boldsymbol y= \boldsymbol a+ \boldsymbol B\boldsymbol x \] where \(\boldsymbol a\) (a \(m \times 1\) vector) is the location parameter and \(\boldsymbol B\) (a \(m \times d\) matrix) the scale parameter For \(m=d\) (square \(\boldsymbol B\)) and \(\det(\boldsymbol B) \neq 0\) the affine transformation is invertible with back-transformation \[\boldsymbol x= \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\]

If \(x\) is a continuous random variable with density \(f_{x}(x)\) and assuming an invertible transformation the density for \(y\) is given by \[ f_{y}(y)=|b|^{-1} f_{x} \left( \frac{y-a}{b}\right) \] where \(|b|\) is the absolute value of \(b\). Likewise, assuming an invertible transformation for a continous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)=|\det(\boldsymbol B)|^{-1} f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \] where \(|\det(\boldsymbol B)|\) is the absolute value of the determinant \(\det(\boldsymbol B)\).

The transformed random variable \(y \sim F_y\) has mean \[\text{E}(y) = a + b \mu_x\] and variance \[\text{Var}(y) = b^2 \sigma^2_x\] where \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original variable \(x\).

The mean and variance of the transformed random vector \(\boldsymbol y\sim F_{\boldsymbol y}\) is \[\text{E}(\boldsymbol y)=\boldsymbol a+ \boldsymbol B\,\boldsymbol \mu_{\boldsymbol x}\] and \[\text{Var}(\boldsymbol y)= \boldsymbol B\,\boldsymbol \Sigma_{\boldsymbol x} \,\boldsymbol B^T\] where \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\).

The constants \(\boldsymbol a\) and \(\boldsymbol B\) (or \(a\) and \(b\) in the univariate case) are the parameters of the location-scale family \(F_{\boldsymbol y}\) created from \(F_{\boldsymbol x}\). Many important distributions are location-scale families such as the normal distribution (cf. Section 5.4 and Section 5.4) and the location-scale \(t\)-distribution (Section 4.7.1). Furthermore, key procedures in multivariate statistics such as orthogonal transformations (including PCA) or whitening transformations (e.g. the Mahalanobis transformation) are affine transformations.

3.2 General invertible transformation of random variables

As above we assume \(x \sim F_x\) is a scalar random variable and \(\boldsymbol x\sim F_{\boldsymbol x}\) is a random vector.

As a generalisation of invertible affine transformations we now consider general invertible transformations. For a scalar random variable we assume the transformation is specified by \(y(x) = h(x)\) and the back-transformation by \(x(y) = h^{-1}(y)\) For a random vector we assume \(\boldsymbol y(\boldsymbol x) = \boldsymbol h(\boldsymbol x)\) is invertible with backtransformation \(\boldsymbol x(\boldsymbol y) = \boldsymbol h^{-1}(\boldsymbol y)\).

If \(x\) is a continuous random variable with density \(f_{x}(x)\) the density of the transformed variable \(y\) can be computed exactly and is given by \[ f_y(y) =\left| D x(y) \right|\, f_x(x(y)) \] where \(D x(y)\) is the derivative of the inverse transformation \(x(y)\).

Likewise, for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = |\det\left( D\boldsymbol x(\boldsymbol y) \right)| \,\, f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\).

The mean and variance of the transformed random variable can typically only be approximated. Assume that \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original random variable \(x\) and \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\). In the delta method the transformation \(y(x)\) resp. \(\boldsymbol y(\boldsymbol x)\) is linearised around the mean \(\mu_x\) respectively \(\boldsymbol \mu_{\boldsymbol x}\) and the mean and variance resulting from the linear transformation is reported.

Specifically, the linear approximation for the scalar-valued function is \[ y(x) \approx y\left(\mu_x\right) + D y\left(\mu_x\right)\, \left(x-\mu_x\right) \] where \(D y(x) = y'(x)\) is the first derivative of the transformation \(y(x)\) and \(D y\left(\mu_x\right)\) is the first derivative evaluated at the mean \(\mu_x\), and for the vector-valued function \[ \boldsymbol y(\boldsymbol x) \approx \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) + D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \left(\boldsymbol x-\boldsymbol \mu_{\boldsymbol x}\right) \] where \(D \boldsymbol y(\boldsymbol x)\) is the Jacobian matrix (vector derivative) for the transformation \(\boldsymbol y(\boldsymbol x)\) and \(D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\) is the Jacobian matrix evaluated at the mean \(\boldsymbol \mu_{\boldsymbol x}\).

In the univariate case the delta method yields as approximation for the mean and variance of the transformed random variable \(y\) \[ \text{E}(y) \approx y\left(\mu_x\right) \] and \[ \text{Var}(y)\approx \left(D y\left(\mu_x\right)\right)^2 \, \sigma^2_x \]

For the vector random variable \(\boldsymbol y\) the delta method yields \[\text{E}(\boldsymbol y)\approx\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\] and \[ \text{Var}(\boldsymbol y)\approx D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \boldsymbol \Sigma_{\boldsymbol x} \, D\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)^T \]

Assuming \(y(x) = a + b x\), with \(x(y) = (y-a)/b\), \(D y(x) = b\) and \(D x(y) = b^{-1}\), recovers the univariate location-scale transformation. Likewise, assuming \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\), with \(\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\), \(D\boldsymbol y(\boldsymbol x) = \boldsymbol B\) and \(D\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}\), recovers the multivariate location-scale transformation.

3.3 Exponential tilting and exponential families

Another way to change the distribution of a random variable is by exponential tilting.

Suppose there is a vector valued function \(\boldsymbol u(x)\) where each component is a transformation of \(x\), usually a simple function such the identity \(x\), the square \(x^2\), the logarithm \(\log(x)\) etc. These are called the canonical statistics. Typically, the dimension of \(\boldsymbol u(x)\) is small.

The exponential tilt of a base distribution \(P_0\) with pdmf \(p_0(x)\) towards the linear combination \(\boldsymbol \eta^T \boldsymbol u(x)\) of the canonical statistics \(\boldsymbol u(x)\) and the canonical parameters \(\boldsymbol \eta\) yields the distribution family \(P_{\boldsymbol \eta}\) with pdmf \[ \begin{split} p(x|\boldsymbol \eta) &= e^{ \boldsymbol \eta^T \boldsymbol u(x)}\, b(x) \, /\, e^{ \psi(\boldsymbol \eta)}\\ &= \underbrace{e^{ \boldsymbol \eta^T \boldsymbol u(x)}}_{\text{exponential tilt}}\, p_0(x) \, /\, e^{ \psi(\boldsymbol \eta)-\psi(0)}\\ \end{split} \] where \(b(x)\) is a positive base function. The normalising factor \(e^{ \psi(\boldsymbol \eta)}\) ensures that \(p(x|\boldsymbol \eta)\) integrates to one. The pdmf of the base distribution is given by \(p_0(x)=b(x) / e^{\psi(0)}\).

The distribution family \(P_{\boldsymbol \eta}\) obtained by exponential tiling is called an exponential family. The corresponding log-pdmf is \[ \log p(x|\boldsymbol \eta) = \boldsymbol \eta^T \boldsymbol u(x) + \log b(x) - \psi(\boldsymbol \eta) \] The log-normaliser or log-partition function \(\psi(\boldsymbol \eta)\) is obtained by computing \[ \psi(\boldsymbol \eta) = \log \int_x \, e^{ \boldsymbol \eta^T \boldsymbol u(x)}\, b(x) \, dx \] The set of values of \(\boldsymbol \eta\) for which the integral is finite and hence for which \(\psi(\boldsymbol \eta) < \infty\) defines the parameter space of the exponential family. Some choices of \(b(x)\) and \(\boldsymbol u(x)\) will not allow for a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.

Many commonly used distribution families are exponential families (most importantly the normal distribution). Exponential families are extremely important in probability and statistics. They provide highly effective models for statistical learning using entropy, likelihood and Bayesian approaches, allow for substantial data reduction via minimal sufficiency, and provide the basis of generalised linear models. Furthermore, exponential families often enable to generalise probabilistic results valid for the normal distribution to more general settings.

3.4 Sums of random variables and convolution

Suppose we have a sum of \(n\) independent and identically distributed (iid) random variables. \[ y = x_1 + x_2 + \ldots + x_n \] where each \(x_i \sim F_x\) with density or probability mass function \(f_x(x)\). The density or probability mass function for \(y\) is obtained by repeated application of convolution (symbolised by the \(\ast\) operator): \[ f_y(y) = \left(f_{x_1} \ast f_{x_2} \ast \ldots f_{x_n}\right)(y) \]

The convolution of two functions is defined as (continuous case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\int_x f_{x_1}(x)\, f_{x_2}(y-x) dx \] and (discrete case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\sum_x f_{x_1}(x)\, f_{x_2}(y-x) \] Convolution is commutative and associative so it can be applied in any order to compute the convolution of multiple functions. Furthermore, the convolution of probability densities / mass function yields another probability density / mass function.

Many commonly used random variables can be viewed as the outcome of convolutions. For example, the sum of Bernoulli variables yields a binomial random variable and the sum of normal variables yields another normal random variable.

See also: list of convolutions of probability distributions.

The central limit theorem, first postulated by Abraham de Moivre (1667–1754) and later proved by Pierre-Simon Laplace (1749–1827) asserts that, under appropriate conditions, the distribution of the sum of independent and identically distributed random variables converges in the limit of large \(n\) to a normal distribution (Section 4.4), even if the individual random variables are not normal. In other words, it asserts that for large \(n\) the convolution of \(n\) identical distributions typically converges to the normal distribution.