3 Transformations and convolution
3.1 Affine or location-scale transformation
Transformation rule
Suppose \(x\) is a scalar. The variable \[ y= a + b x \] is a location-scale transformation or affine transformation of \(x\), where \(a\) plays the role of the location parameter and \(b\) is the scale parameter. For \(a=0\) this is a linear transformation.
If \(b\neq 0\) then the transformation is invertible, with back-transformation \[x = (y-a)/b\] Invertible transformations provide a one-to-one map between \(x\) and \(y\).
For a vector \(\boldsymbol x\) of dimension \(d\) the location-scale transformation is \[ \boldsymbol y= \boldsymbol a+ \boldsymbol B\boldsymbol x \] where \(\boldsymbol a\) (a \(m \times 1\) vector) is the location parameter and \(\boldsymbol B\) (a \(m \times d\) matrix) the scale parameter. For \(\boldsymbol a=0\) this is a linear transformation.
For \(m=d\) (square \(\boldsymbol B\)) and \(\det(\boldsymbol B) \neq 0\) the affine transformation is invertible with back-transformation \[\boldsymbol x= \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\]
Probability mass function
If \(x \sim F_x\) is a discrete scalar random variable with pmf \(f_{x}(x)\) and assuming an invertible transformation \(y(x)= a + b x\) the pmf \(f_{y}(y)\) for the discrete scalar random variable \(y\) is given by \[ f_{y}(y)= f_{x} \left( \frac{y-a}{b}\right) \]
Likewise, if \(\boldsymbol x\sim F_{\boldsymbol x}\) is a discrete random vector with pmf \(f_{\boldsymbol x}(\boldsymbol x)\) and assuming an invertible transformation \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\) the pmf \(f_{\boldsymbol y}(\boldsymbol y)\) for the discrete random vector \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)= f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \]
Density
If \(x \sim F_x\) is a continuous scalar random variable with pdf \(f_{x}(x)\) and assuming an invertible transformation \(y(x)= a + b x\) the pdf \(f_{y}(y)\) for the continuous random scalar \(y\) is given by \[ f_{y}(y)=|b|^{-1} f_{x} \left( \frac{y-a}{b}\right) \] where \(|b|\) is the absolute value of \(b\). The transformation of the corresponding differential element is \[ dy = |b| \, dx \]
Likewise, if \(\boldsymbol x\sim F_{\boldsymbol x}\) is a continuous random vector with pdf \(f_{\boldsymbol x}(\boldsymbol x)\) and assuming an invertible transformation \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\) the pdf \(f_{\boldsymbol y}(\boldsymbol y)\) for the continuous random vector \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)=|\det\left(\boldsymbol B\right)|^{-1} f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \] where \(|\det(\boldsymbol B)|\) is the absolute value of the determinant \(\det(\boldsymbol B)\). The transformation of the corresponding infinitesimal volume element is \[ d{\boldsymbol y} = |\det\left(\boldsymbol B\right)|\, d{\boldsymbol x} \]
Moments
The transformed random variable \(y \sim F_y\) has mean \[\operatorname{E}(y) = a + b \mu_x\] and variance \[\operatorname{Var}(y) = b^2 \sigma^2_x\] where \(\operatorname{E}(x) = \mu_x\) and \(\operatorname{Var}(x) = \sigma^2_x\) are the mean and variance of the original variable \(x\).
The mean and variance of the transformed random vector \(\boldsymbol y\sim F_{\boldsymbol y}\) is \[\operatorname{E}(\boldsymbol y)=\boldsymbol a+ \boldsymbol B\,\boldsymbol \mu_{\boldsymbol x}\] and \[\operatorname{Var}(\boldsymbol y)= \boldsymbol B\,\boldsymbol \Sigma_{\boldsymbol x} \,\boldsymbol B^T\] where \(\operatorname{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\operatorname{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\).
Importance of affine transformations
The constants \(\boldsymbol a\) and \(\boldsymbol B\) (or \(a\) and \(b\) in the univariate case) are the parameters of the location-scale family \(F_{\boldsymbol y}(\boldsymbol a, \boldsymbol B)\) created from \(F_{\boldsymbol x}\). Many important distributions are location-scale families such as the normal distribution (cf. Section 5.3 and Section 6.3) and the location-scale \(t\)-distribution (Section 5.6 and Section 6.6). Furthermore, key procedures in multivariate statistics such as orthogonal transformations (including PCA) or whitening transformations (e.g. the Mahalanobis transformation) are affine transformations.
3.2 General invertible transformation
Transformation rule
As above we assume \(x\) is a scalar and \(\boldsymbol x\) is a vector and consider the general invertible transformation.
For a scalar variable the transformation is specified by \(y(x) = h(x)\) and the back-transformation by \(x(y) = h^{-1}(y)\). For a vector this becomes \(\boldsymbol y(\boldsymbol x) = \boldsymbol h(\boldsymbol x)\) with back-transformation \(\boldsymbol x(\boldsymbol y) = \boldsymbol h^{-1}(\boldsymbol y)\). The functions \(h(x)\) and \(\boldsymbol h(\boldsymbol x)\) are assumed to be invertible.
Probability mass function
If \(x \sim F_x\) is a discrete scalar random variable with pmf \(f_{x}(x)\) then the pmf \(f_y(y)\) of the transformed discrete scalar random variable \(y(x)\) is given by \[ f_y(y) = f_x(x(y)) \]
Likewise, for a discrete random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) with pmf \(f_{\boldsymbol x}(\boldsymbol x)\) the pmf \(f_{\boldsymbol y}(\boldsymbol y)\) for the discrete random vector \(\boldsymbol y(\boldsymbol x)\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \]
Density
If \(x \sim F_x\) is a continuous scalar random variable with pdf \(f_{x}(x)\) the pdf \(f_y(y)\) of the transformed continuous scalar random variable \(y(x)\) is given by \[ f_y(y) =\left| D x(y) \right|\, f_x(x(y)) \] where \(D x(y)\) is the derivative of the inverse transformation \(x(y)\). The transformation of the differential element is \[ dy = \left| D y(x) \right| \, dx \] Note that \(| D x(y)| = | D y(x)|^{-1}\rvert_{x = x(y)}\).
Likewise, for a continuous random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) with pdf \(f_{\boldsymbol x}(\boldsymbol x)\) the pdf \(f_{\boldsymbol y}(\boldsymbol y)\) for the continuous random vector \(\boldsymbol y(\boldsymbol x)\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = |\det\left( D\boldsymbol x(\boldsymbol y) \right)| \,\, f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\). The transformation of the infinitesimal volume element is \[ d{\boldsymbol y} = |\det\left( D\boldsymbol y(\boldsymbol x) \right)|\, d{\boldsymbol x} \] Note that \(|\det\left( D\boldsymbol x(\boldsymbol y) \right)| = |\det\left( D\boldsymbol y(\boldsymbol x) \right)|^{-1} \rvert_{\boldsymbol x= \boldsymbol x(\boldsymbol y)}\).
Moments
The mean and variance of the transformed random variable can typically only be approximated. Assume that \(\operatorname{E}(x) = \mu_x\) and \(\operatorname{Var}(x) = \sigma^2_x\) are the mean and variance of the original random variable \(x\) and \(\operatorname{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\operatorname{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\). In the delta method the transformation \(y(x)\) resp. \(\boldsymbol y(\boldsymbol x)\) is linearised around the mean \(\mu_x\) respectively \(\boldsymbol \mu_{\boldsymbol x}\) and the mean and variance resulting from the linear transformation is reported.
Specifically, the linear approximation for the scalar-valued function is \[ y(x) \approx y\left(\mu_x\right) + D y\left(\mu_x\right)\, \left(x-\mu_x\right) \] where \(D y(x) = y'(x)\) is the first derivative of the transformation \(y(x)\) and \(D y\left(\mu_x\right)\) is the first derivative evaluated at the mean \(\mu_x\), and for the vector-valued function \[ \boldsymbol y(\boldsymbol x) \approx \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) + D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \left(\boldsymbol x-\boldsymbol \mu_{\boldsymbol x}\right) \] where \(D \boldsymbol y(\boldsymbol x)\) is the Jacobian matrix (vector derivative) for the transformation \(\boldsymbol y(\boldsymbol x)\) and \(D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\) is the Jacobian matrix evaluated at the mean \(\boldsymbol \mu_{\boldsymbol x}\).
In the univariate case the delta method yields as approximation for the mean and variance of the transformed random variable \(y\) \[ \operatorname{E}(y) \approx y\left(\mu_x\right) \] and \[ \operatorname{Var}(y)\approx \left(D y\left(\mu_x\right)\right)^2 \, \sigma^2_x \]
For the vector random variable \(\boldsymbol y\) the delta method yields \[\operatorname{E}(\boldsymbol y)\approx\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\] and \[ \operatorname{Var}(\boldsymbol y)\approx D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \boldsymbol \Sigma_{\boldsymbol x} \, D\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)^T \]
Invertible affine transformation as special case
The invertible affine transformation (Section 3.1) is special case of the general invertible transformation.
Assuming \(y(x) = a + b x\), with \(x(y) = (y-a)/b\), \(D y(x) = b\) and \(D x(y) = b^{-1}\), recovers the univariate location-scale transformation.
Likewise, assuming \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\), with \(\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\), \(D\boldsymbol y(\boldsymbol x) = \boldsymbol B\) and \(D\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}\), recovers the multivariate location-scale transformation.
3.3 Convolution of random variables
Sum of independent random variables
Suppose we have a sum of \(n\) independent scalar random variables. \[ y = x_1 + x_2 + \ldots + x_n \] where each \(x_i \sim F_{x_i}\) has its own distribution and corresponding pdmf \(f_{x_i}(x)\). The corresponding means are \(\operatorname{E}(x_i) = \mu_i\) and the variances are \(\operatorname{Var}(x_i) = \sigma^2_i\). As the \(x_i\) are independent, and therefore uncorrelated, the covariances \(\operatorname{Cov}(x_i, x_j)=0\) vanish for \(i \neq j\).
With \(\boldsymbol x= (x_1, \ldots, x_n)^T\) and \(\mathbf 1_n = (1, 1, \ldots, 1)^T\) the relationship between \(y\) and \(\boldsymbol x\) can be written as the linear transformation \[ y= \mathbf 1_n^T \boldsymbol x \] As \(y\) is a scalar and \(\boldsymbol x\) a vector the transformation from \(\boldsymbol x\) to \(y\) is not invertible.
Moments
With \(\operatorname{E}(\boldsymbol x) = \boldsymbol \mu\) and \(\operatorname{Var}(\boldsymbol x) = \operatorname{Diag}(\sigma^2_1, \ldots, \sigma^2_n)\) the mean of the random variable \(y\) equals \[
\operatorname{E}(y) = \mathbf 1_n^T \boldsymbol \mu= \sum_{i=1}^n \mu_i
\] and the variance of \(y\) is \[
\operatorname{Var}(y) = \mathbf 1_n^T \, \operatorname{Var}(\boldsymbol x) \, \mathbf 1_n = \sum_{i=1}^n \sigma^2_i
\]
(cf. Section 3.1). Thus both the mean and variance of \(y\) are simply the sums of the individual means and variances (note that for the variance this only holds because the individual variables are uncorrelated).
Convolution
The pdmf \(f_y(y)\) for \(y\) is obtained by repeatedly convolving (denoted by the asterisk \(\ast\) operator) the pdmfs of the \(x_i\): \[ f_y(y) = \left(f_{x_1} \ast f_{x_2} \ast \ldots f_{x_n}\right)(y) \]
The convolution of two functions is defined as (continuous case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\int_x f_{x_1}(x)\, f_{x_2}(y-x) dx \] and (discrete case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\sum_x f_{x_1}(x)\, f_{x_2}(y-x) \] Convolution is commutative and associative so so you may convolve multiple pdmfs in any order or grouping. Furthermore, the convolution of pdmfs yields another pdmf, i.e. the resulting function integrates to one.
Many commonly used random variables can be viewed as the outcome of convolutions. For example, the sum of Bernoulli variables yields a binomial random variable and the sum of normal variables yields another normal random variable.
See also (Wikipedia): list of convolutions of probability distributions.
Central limit theorem
The central limit theorem, first postulated by Abraham de Moivre (1667–1754) and later proved by Pierre-Simon Laplace (1749–1827) asserts that the distribution of the sum of \(n\) independent and identically distributed random variables with finite mean and finite variance converges in the limit of large \(n\) to a normal distribution (Section 5.3), even if the individual random variables are not themselves normal. In other words, it asserts that for large \(n\) the convolution of \(n\) identical distributions with finite first two moments converges to a normal distribution.