3 Transformations
3.1 Affine or location-scale transformation of random variables
Affine transformation
Suppose \(x \sim F_x\) is a scalar random variable. The random variable \[y= a + b x\] is a location-scale transformation or affine transformation of \(x\), where \(a\) plays the role of the location parameter and \(b\) is the scale parameter. For \(a=0\) this is a linear transformation. If \(b\neq 0\) then the transformation is invertible, with back-transformation \[x = (y-a)/b\] Invertible transformations provide a one-to-one map between \(x\) and \(y\).
For a random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) of dimension \(d\) the location-scale transformation is \[ \boldsymbol y= \boldsymbol a+ \boldsymbol B\boldsymbol x \] where \(\boldsymbol a\) (a \(m \times 1\) vector) is the location parameter and \(\boldsymbol B\) (a \(m \times d\) matrix) the scale parameter For \(m=d\) (square \(\boldsymbol B\)) and \(\det(\boldsymbol B) \neq 0\) the affine transformation is invertible with back-transformation \[\boldsymbol x= \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\]
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) and assuming an invertible transformation the density for \(y\) is given by \[ f_{y}(y)=|b|^{-1} f_{x} \left( \frac{y-a}{b}\right) \] where \(|b|\) is the absolute value of \(b\). Likewise, assuming an invertible transformation for a continous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)=|\det(\boldsymbol B)|^{-1} f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \] where \(|\det(\boldsymbol B)|\) is the absolute value of the determinant \(\det(\boldsymbol B)\).
Moments
The transformed random variable \(y \sim F_y\) has mean \[\text{E}(y) = a + b \mu_x\] and variance \[\text{Var}(y) = b^2 \sigma^2_x\] where \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original variable \(x\).
The mean and variance of the transformed random vector \(\boldsymbol y\sim F_{\boldsymbol y}\) is \[\text{E}(\boldsymbol y)=\boldsymbol a+ \boldsymbol B\,\boldsymbol \mu_{\boldsymbol x}\] and \[\text{Var}(\boldsymbol y)= \boldsymbol B\,\boldsymbol \Sigma_{\boldsymbol x} \,\boldsymbol B^T\] where \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\).
Importance of affine transformations
The constants \(\boldsymbol a\) and \(\boldsymbol B\) (or \(a\) and \(b\) in the univariate case) are the parameters of the location-scale family \(F_{\boldsymbol y}\) created from \(F_{\boldsymbol x}\). Many important distributions are location-scale families such as the normal distribution (cf. Section 4.3 and Section 5.3) and the location-scale \(t\)-distribution (Section 4.6 and Section 5.6). Furthermore, key procedures in multivariate statistics such as orthogonal transformations (including PCA) or whitening transformations (e.g. the Mahalanobis transformation) are affine transformations.
3.2 General invertible transformation of random variables
General invertible transformation
As above we assume \(x \sim F_x\) is a scalar random variable and \(\boldsymbol x\sim F_{\boldsymbol x}\) is a random vector.
As a generalisation of invertible affine transformations we now consider general invertible transformations. For a scalar random variable we assume the transformation is specified by \(y(x) = h(x)\) and the back-transformation by \(x(y) = h^{-1}(y)\) For a random vector we assume \(\boldsymbol y(\boldsymbol x) = \boldsymbol h(\boldsymbol x)\) is invertible with back-transformation \(\boldsymbol x(\boldsymbol y) = \boldsymbol h^{-1}(\boldsymbol y)\).
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) the density of the transformed variable \(y\) can be computed exactly and is given by \[ f_y(y) =\left| D x(y) \right|\, f_x(x(y)) \] where \(D x(y)\) is the derivative of the inverse transformation \(x(y)\).
Likewise, for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = |\det\left( D\boldsymbol x(\boldsymbol y) \right)| \,\, f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\).
Moments
The mean and variance of the transformed random variable can typically only be approximated. Assume that \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original random variable \(x\) and \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\). In the delta method the transformation \(y(x)\) resp. \(\boldsymbol y(\boldsymbol x)\) is linearised around the mean \(\mu_x\) respectively \(\boldsymbol \mu_{\boldsymbol x}\) and the mean and variance resulting from the linear transformation is reported.
Specifically, the linear approximation for the scalar-valued function is \[ y(x) \approx y\left(\mu_x\right) + D y\left(\mu_x\right)\, \left(x-\mu_x\right) \] where \(D y(x) = y'(x)\) is the first derivative of the transformation \(y(x)\) and \(D y\left(\mu_x\right)\) is the first derivative evaluated at the mean \(\mu_x\), and for the vector-valued function \[ \boldsymbol y(\boldsymbol x) \approx \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) + D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \left(\boldsymbol x-\boldsymbol \mu_{\boldsymbol x}\right) \] where \(D \boldsymbol y(\boldsymbol x)\) is the Jacobian matrix (vector derivative) for the transformation \(\boldsymbol y(\boldsymbol x)\) and \(D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\) is the Jacobian matrix evaluated at the mean \(\boldsymbol \mu_{\boldsymbol x}\).
In the univariate case the delta method yields as approximation for the mean and variance of the transformed random variable \(y\) \[ \text{E}(y) \approx y\left(\mu_x\right) \] and \[ \text{Var}(y)\approx \left(D y\left(\mu_x\right)\right)^2 \, \sigma^2_x \]
For the vector random variable \(\boldsymbol y\) the delta method yields \[\text{E}(\boldsymbol y)\approx\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\] and \[ \text{Var}(\boldsymbol y)\approx D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \boldsymbol \Sigma_{\boldsymbol x} \, D\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)^T \]
Assuming \(y(x) = a + b x\), with \(x(y) = (y-a)/b\), \(D y(x) = b\) and \(D x(y) = b^{-1}\), recovers the univariate location-scale transformation. Likewise, assuming \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\), with \(\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\), \(D\boldsymbol y(\boldsymbol x) = \boldsymbol B\) and \(D\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}\), recovers the multivariate location-scale transformation.
3.3 Exponential tilting and exponential families
Another way to change the distribution of a random variable is by exponential tilting.
Suppose there is a vector valued function \(\boldsymbol u(x)\) where each component is a transformation of \(x\), usually a simple function such the identity \(x\), the square \(x^2\), the logarithm \(\log(x)\) etc. These are called the canonical statistics. Typically, the dimension of \(\boldsymbol u(x)\) is small.
The exponential tilt of a base distribution \(P_0\) with pdmf \(p_0(x)\) towards the linear combination \(\boldsymbol \eta^T \boldsymbol u(x)\) of the canonical statistics \(\boldsymbol u(x)\) and the canonical parameters \(\boldsymbol \eta\) yields the distribution family \(P_{\boldsymbol \eta}\) with pdmf \[ p(x|\boldsymbol \eta) = \underbrace{e^{ \boldsymbol \eta^T \boldsymbol u(x)}}_{\text{exponential tilt}}\, p_0(x) \, /\, z(\boldsymbol \eta) \] The normalising factor \(z(\boldsymbol \eta)\) ensures that \(p(x|\boldsymbol \eta)\) integrates to one, with \[ z(\boldsymbol \eta) = \int_x \, e^{ \boldsymbol \eta^T \boldsymbol u(x)}\, p_0(x) \, dx \] and \(z(\mathbf 0)=\int_x p_0(x) \, dx=1\).
A distribution family \(P_{\boldsymbol \eta}\) obtained by exponential tiling is called an exponential family. The set of values of \(\boldsymbol \eta\) for which \(z(\boldsymbol \eta) < \infty\), and hence for which \(p(x|\boldsymbol \eta)\) is well defined, comprises the parameter space of the exponential family. Some choices of \(p_0(x)\) and \(\boldsymbol u(x)\) do not yield a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.
Many commonly used distribution families are exponential families, such as the normal distribution and the Bernoulli distribution. Exponential families are extremely important in probability and statistics. They provide highly effective models for statistical learning using entropy, likelihood and Bayesian approaches, allow for substantial data reduction via minimal sufficiency and provide the basis for generalised linear models. Furthermore, exponential families often allow to generalise probabilistic results, e.g. established for the normal distribution, to a broader domain.
See also (Wikipedia): exponential family — table of distributions.
3.4 Sums of random variables and convolution
Moments
Suppose we have a sum of \(n\) independent random variables. \[ y = x_1 + x_2 + \ldots + x_n \] where each \(x_i \sim F_{x_i}\) has its own distribution and corresponding probability density mass function \(f_{x_i}(x)\).
With \(\boldsymbol x= (x_1, \ldots, x_n)^T\) and \(\mathbf 1_n = (1, 1, \ldots, 1)^T\) the relationship between \(y\) and \(\boldsymbol x\) can be written as affine transformation \(y= \mathbf 1_n^T \boldsymbol x\). Assuming \(\text{E}(x_i) = \mu_i\), \(\text{Var}(x_i) = \sigma^2_i\) and \(\text{Cov}(x_i, x_j)=0\) for \(i\neq j\) the mean and variance of the random variable \(y\) equals (cf. Section 3.1) \[
\text{E}(y) = \mathbf 1_n^T \boldsymbol \mu= \sum_{i=1}^n \mu_i
\] and \[
\text{Var}(y) = \mathbf 1_n^T \, \text{Var}(\boldsymbol x) \, \mathbf 1_n = \sum_{i=1}^n \sigma^2_i
\]
Thus both the means and variance are additive (but note that for the variance this is only true because of the independence assumption).
Convolution
The pdmf for \(y\) is obtained by repeated application of convolution (symbolised by the \(\ast\) operator): \[ f_y(y) = \left(f_{x_1} \ast f_{x_2} \ast \ldots f_{x_n}\right)(y) \]
The convolution of two functions is defined as (continuous case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\int_x f_{x_1}(x)\, f_{x_2}(y-x) dx \] and (discrete case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\sum_x f_{x_1}(x)\, f_{x_2}(y-x) \] Convolution is commutative and associative so it can be applied in any order to compute the convolution of multiple functions. Furthermore, the convolution of pdfms yields another pdmf.
Many commonly used random variables can be viewed as the outcome of convolutions. For example, the sum of Bernoulli variables yields a binomial random variable and the sum of normal variables yields another normal random variable.
See also (Wikipedia): list of convolutions of probability distributions.
Central limit theorem
The central limit theorem, first postulated by Abraham de Moivre (1667–1754) and later proved by Pierre-Simon Laplace (1749–1827) asserts that the distribution of the sum of \(n\) independent and identically distributed random variables with finite mean and finite variance converges in the limit of large \(n\) to a normal distribution (Section 4.3), even if the individual random variables are not themselves normal. In other words, it asserts that for large \(n\) the convolution of \(n\) identical distributions with finite first two moments converges to the normal distribution.