3 Transformations
3.1 Affine or location-scale transformation
Transformation rule
Suppose \(x \sim F_x\) is a scalar random variable. The random variable \[y= a + b x\] is a location-scale transformation or affine transformation of \(x\), where \(a\) plays the role of the location parameter and \(b\) is the scale parameter. For \(a=0\) this is a linear transformation. If \(b\neq 0\) then the transformation is invertible, with back-transformation \[x = (y-a)/b\] Invertible transformations provide a one-to-one map between \(x\) and \(y\).
For a random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) of dimension \(d\) the location-scale transformation is \[ \boldsymbol y= \boldsymbol a+ \boldsymbol B\boldsymbol x \] where \(\boldsymbol a\) (a \(m \times 1\) vector) is the location parameter and \(\boldsymbol B\) (a \(m \times d\) matrix) the scale parameter For \(m=d\) (square \(\boldsymbol B\)) and \(\det(\boldsymbol B) \neq 0\) the affine transformation is invertible with back-transformation \[\boldsymbol x= \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\]
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) and assuming an invertible transformation the density for \(y\) is given by \[ f_{y}(y)=|b|^{-1} f_{x} \left( \frac{y-a}{b}\right) \] where \(|b|\) is the absolute value of \(b\). Likewise, assuming an invertible transformation for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)=|\det(\boldsymbol B)|^{-1} f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \] where \(|\det(\boldsymbol B)|\) is the absolute value of the determinant \(\det(\boldsymbol B)\).
Moments
The transformed random variable \(y \sim F_y\) has mean \[\text{E}(y) = a + b \mu_x\] and variance \[\text{Var}(y) = b^2 \sigma^2_x\] where \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original variable \(x\).
The mean and variance of the transformed random vector \(\boldsymbol y\sim F_{\boldsymbol y}\) is \[\text{E}(\boldsymbol y)=\boldsymbol a+ \boldsymbol B\,\boldsymbol \mu_{\boldsymbol x}\] and \[\text{Var}(\boldsymbol y)= \boldsymbol B\,\boldsymbol \Sigma_{\boldsymbol x} \,\boldsymbol B^T\] where \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\).
Importance of affine transformations
The constants \(\boldsymbol a\) and \(\boldsymbol B\) (or \(a\) and \(b\) in the univariate case) are the parameters of the location-scale family \(F_{\boldsymbol y}\) created from \(F_{\boldsymbol x}\). Many important distributions are location-scale families such as the normal distribution (cf. Section 4.3 and Section 5.3) and the location-scale \(t\)-distribution (Section 4.6 and Section 5.6). Furthermore, key procedures in multivariate statistics such as orthogonal transformations (including PCA) or whitening transformations (e.g. the Mahalanobis transformation) are affine transformations.
3.2 General invertible transformation
Transformation rule
As above we assume \(x \sim F_x\) is a scalar random variable and \(\boldsymbol x\sim F_{\boldsymbol x}\) is a random vector.
As a generalisation of invertible affine transformations we now consider general invertible transformations. For a scalar random variable we assume the transformation is specified by \(y(x) = h(x)\) and the back-transformation by \(x(y) = h^{-1}(y)\) For a random vector we assume \(\boldsymbol y(\boldsymbol x) = \boldsymbol h(\boldsymbol x)\) is invertible with back-transformation \(\boldsymbol x(\boldsymbol y) = \boldsymbol h^{-1}(\boldsymbol y)\).
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) the density of the transformed variable \(y\) can be computed exactly and is given by \[ f_y(y) =\left| D x(y) \right|\, f_x(x(y)) \] where \(D x(y)\) is the derivative of the inverse transformation \(x(y)\).
Likewise, for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = |\det\left( D\boldsymbol x(\boldsymbol y) \right)| \,\, f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\).
Moments
The mean and variance of the transformed random variable can typically only be approximated. Assume that \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original random variable \(x\) and \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\). In the delta method the transformation \(y(x)\) resp. \(\boldsymbol y(\boldsymbol x)\) is linearised around the mean \(\mu_x\) respectively \(\boldsymbol \mu_{\boldsymbol x}\) and the mean and variance resulting from the linear transformation is reported.
Specifically, the linear approximation for the scalar-valued function is \[ y(x) \approx y\left(\mu_x\right) + D y\left(\mu_x\right)\, \left(x-\mu_x\right) \] where \(D y(x) = y'(x)\) is the first derivative of the transformation \(y(x)\) and \(D y\left(\mu_x\right)\) is the first derivative evaluated at the mean \(\mu_x\), and for the vector-valued function \[ \boldsymbol y(\boldsymbol x) \approx \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) + D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \left(\boldsymbol x-\boldsymbol \mu_{\boldsymbol x}\right) \] where \(D \boldsymbol y(\boldsymbol x)\) is the Jacobian matrix (vector derivative) for the transformation \(\boldsymbol y(\boldsymbol x)\) and \(D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\) is the Jacobian matrix evaluated at the mean \(\boldsymbol \mu_{\boldsymbol x}\).
In the univariate case the delta method yields as approximation for the mean and variance of the transformed random variable \(y\) \[ \text{E}(y) \approx y\left(\mu_x\right) \] and \[ \text{Var}(y)\approx \left(D y\left(\mu_x\right)\right)^2 \, \sigma^2_x \]
For the vector random variable \(\boldsymbol y\) the delta method yields \[\text{E}(\boldsymbol y)\approx\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\] and \[ \text{Var}(\boldsymbol y)\approx D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \boldsymbol \Sigma_{\boldsymbol x} \, D\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)^T \]
Assuming \(y(x) = a + b x\), with \(x(y) = (y-a)/b\), \(D y(x) = b\) and \(D x(y) = b^{-1}\), recovers the univariate location-scale transformation. Likewise, assuming \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\), with \(\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\), \(D\boldsymbol y(\boldsymbol x) = \boldsymbol B\) and \(D\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}\), recovers the multivariate location-scale transformation.
3.3 Exponential tilting and exponential families
Another way to change the distribution of a random variable is by exponential tilting.
Suppose there is a vector valued function \(\boldsymbol u(x)\) where each component is a transformation of \(x\), usually a simple function such the identity \(x\), the square \(x^2\), the logarithm \(\log(x)\) etc. These are called the canonical statistics. Typically, the dimension of \(\boldsymbol u(x)\) is small.
The exponential tilt of a base distribution \(P_0\) with pdmf \(p_0(x)\) towards the linear combination \(\boldsymbol \eta^T \boldsymbol u(x)\) of the canonical statistics \(\boldsymbol u(x)\) and the canonical parameters \(\boldsymbol \eta\) yields the distribution family \(P_{\boldsymbol \eta}\) with pdmf \[ p(x|\boldsymbol \eta) = \underbrace{e^{ \boldsymbol \eta^T \boldsymbol u(x)}}_{\text{exponential tilt}}\, p_0(x) \, /\, z(\boldsymbol \eta) \] The normalising factor \(z(\boldsymbol \eta)\) ensures that \(p(x|\boldsymbol \eta)\) integrates to one, with \[ z(\boldsymbol \eta) = \int_x \, e^{ \boldsymbol \eta^T \boldsymbol u(x)}\, p_0(x) \, dx \] and \(z(\mathbf 0)=\int_x p_0(x) \, dx=1\).
A distribution family \(P_{\boldsymbol \eta}\) obtained by exponential tiling is called an exponential family. The set of values of \(\boldsymbol \eta\) for which \(z(\boldsymbol \eta) < \infty\), and hence for which \(p(x|\boldsymbol \eta)\) is well defined, comprises the parameter space of the exponential family. Some choices of \(p_0(x)\) and \(\boldsymbol u(x)\) do not yield a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.
Many commonly used distribution families are exponential families, such as the normal distribution and the Bernoulli distribution. Exponential families are extremely important in probability and statistics. They provide highly effective models for statistical learning using entropy, likelihood and Bayesian approaches, allow for substantial data reduction via minimal sufficiency and provide the basis for generalised linear models. Furthermore, exponential families often allow to generalise probabilistic results, e.g. established for the normal distribution, to a broader domain.
See also (Wikipedia): exponential family — table of distributions.
3.4 Sums of random variables and convolution
Moments
Suppose we have a sum of \(n\) independent random variables. \[ y = x_1 + x_2 + \ldots + x_n \] where each \(x_i \sim F_{x_i}\) has its own distribution and corresponding probability density mass function \(f_{x_i}(x)\).
With \(\boldsymbol x= (x_1, \ldots, x_n)^T\) and \(\mathbf 1_n = (1, 1, \ldots, 1)^T\) the relationship between \(y\) and \(\boldsymbol x\) can be written as affine transformation \(y= \mathbf 1_n^T \boldsymbol x\). Assuming \(\text{E}(x_i) = \mu_i\), \(\text{Var}(x_i) = \sigma^2_i\) and \(\text{Cov}(x_i, x_j)=0\) for \(i\neq j\) the mean and variance of the random variable \(y\) equals (cf. Section 3.1) \[
\text{E}(y) = \mathbf 1_n^T \boldsymbol \mu= \sum_{i=1}^n \mu_i
\] and \[
\text{Var}(y) = \mathbf 1_n^T \, \text{Var}(\boldsymbol x) \, \mathbf 1_n = \sum_{i=1}^n \sigma^2_i
\]
Thus both the means and variance are additive (but note that for the variance this is only true because of the independence assumption).
Convolution
The pdmf for \(y\) is obtained by repeatedly convolving (denoted by the asterisk \(\ast\) operator) the pdmfs of the \(x_i\): \[ f_y(y) = \left(f_{x_1} \ast f_{x_2} \ast \ldots f_{x_n}\right)(y) \]
The convolution of two functions is defined as (continuous case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\int_x f_{x_1}(x)\, f_{x_2}(y-x) dx \] and (discrete case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\sum_x f_{x_1}(x)\, f_{x_2}(y-x) \] Convolution is commutative and associative so so you may convolve multiple pdmfs in any order or grouping. Furthermore, the convolution of pdfms yields another pdmf, i.e. the resulting function integrates to one.
Many commonly used random variables can be viewed as the outcome of convolutions. For example, the sum of Bernoulli variables yields a binomial random variable and the sum of normal variables yields another normal random variable.
See also (Wikipedia): list of convolutions of probability distributions.
Central limit theorem
The central limit theorem, first postulated by Abraham de Moivre (1667–1754) and later proved by Pierre-Simon Laplace (1749–1827) asserts that the distribution of the sum of \(n\) independent and identically distributed random variables with finite mean and finite variance converges in the limit of large \(n\) to a normal distribution (Section 4.3), even if the individual random variables are not themselves normal. In other words, it asserts that for large \(n\) the convolution of \(n\) identical distributions with finite first two moments converges to the normal distribution.
3.5 Loss and risk functions
Loss function
A loss or cost function \(L(x, a)\) evaluates a prediction \(a\) (for example a parameter or a probability distribution) on the basis of an observed outcome \(x\), and returns a numerical score.
A loss function measures, informally, the error between \(x\) and \(a\). During optimisation the prediction \(a\) is varied and the aim is minimisation of the error (hence a loss function has negative orientation, smaller is better).
Adding a constant or a positive scaling factor to the loss function will not change the location of its minimum, so such loss functions are considered equivalent.
A utility or reward function is a loss function with a reversed sign (hence it has positive orientation, larger is better).
Risk function
The risk is defined as the expected loss \[ R(a) = \text{E}(L(x, a)) \,. \] A risk function \(R(a)\) is thus constructed from a random variable \(x\) and an associated loss function. The expectation is taken with regard to the distribution of \(x\).
The empirical risk \[ \hat{R}(a) = \frac{1}{n} \sum_{i=1}^{n} L(x_i, a)) \] is obtained by replacing the expectation by a sample average.
Minimising \(R(a)\) finds optimal predictions
\[
a^{\ast} = \underset{a}{\arg \min}\, R(a)
\] Depending on the choice of underlying loss \(L(x, a)\) minimising the risk provides a very general optimisation-based way to identify distributional features of the distribution of \(x\) and to obtain parameter estimates.
Scoring rules
A scoring rule \(S(x, P)\) is special type of loss function that assesses the probabilistic forecast \(P\) by assigning a numerical score based on \(P\) and the observed outcome \(x\).
As a loss function, scoring rules are negatively oriented. However, some authors consider them as utility functions with positive orientation.
The associated expected risk is \[ R(P) = \text{E}_{Q}\left(S(x, P)\right) \] where \(Q\) is the data generating model \(x \sim Q\).
The risk \(R(P)\) of a proper scoring rule is minimised if the forecast \(P\) equals the true distribution \(Q\). It is strictly proper if the minimum is unique. As result, for a proper scoring rule the risk satisfies \[ R(P) \geq R(Q) \] with minimal risk \(R(Q)\). The difference \[ D(Q, P) = R(P) - R(Q) \geq 0 \] is called the divergence between the two distributions \(Q\) and \(P\). By construction, the divergence \(D(Q, P)\) induced by a proper scoring rule is always non-negative and equals zero only if \(P=Q\).
Proper scoring rules are very useful as they allow to identify the underlying distribution and their parameters by risk minimisation or minimisation of the associated divergences.
Proper scoring rules also have a number of further useful properties. For example, various decompositions exist for their risk, and the divergence satisfies a generalised Pythagorean theorem. Furthermore, there is a correspondence of proper scoring rules and their associated divergences with Bregman divergences.
Squared loss function
The squared loss or squared error is one of the most commonly used loss functions: \[ L(x,a) = (x-a)^2 \]
The corresponding risk \[ R(a) = \text{E}((x-a)^2) \] is the mean squared loss or mean squared error (MSE).
From \(R(a) = \text{E}((x-a)^2) = \text{E}(x^2) - 2 a \text{E}(x) + a^2\) it follows \(dR(a)/da = - 2 \text{E}(x) + 2 a\) and thus that the MSE is minimised at the mean \(a^{\ast} = \text{E}(x)\). The achieved minimum risk \(R(a^{\ast}) = \text{Var}(x)\) is the variance.
Other loss functions
The 0-1 loss function can be written as \[ L(x, a) = \begin{cases} -[x = a] & \text{discrete case} \\ -\delta(x-a) & \text{continuous case} \\ \end{cases} \] employing the indicator function and Dirac delta function, respectively. The corresponding risk assuming \(x \sim Q\) and pdmf \(q(x)\) is \[ R(a) = -q(a) \] which is minimised at the mode of the pdmf.
The asymmetric loss can be defined as \[ L(x, a; \tau) = \begin{cases} 2 \tau (x-a) & \text{for $x\geq a$} \\ 2 (1-\tau) (a-x) & \text{for $x < a$} \\ \end{cases} \] and the corresponding risk is minimised at the quantile \(x_{\tau}\).
For \(\tau=1/2\) it reduces to the absolute loss \[ L(x, a) = | x - a| \] whose corresponding risk is minimised at the median \(x_{1/2}\).
Logarithmic scoring rule
The most important example of a strictly proper scoring rule is the logarithmic scoring rule \(S(x, P) = - \log p(x)\), also called log-loss. It is the only local strictly proper scoring rule, with the score solely depending on the value \(p(x)\), i.e. only on the value of the pdmf at the observed outcome \(x\), and not on any other features of the distribution \(P\).
The risk associated with the log-loss is the cross-entropy \[ R(P) = - \text{E}_{Q} \log p(x) = H(Q, P) \] and the achieved minimum at \(P=Q\) is the entropy of \(Q\) \[ R(Q) = -\text{E}_{Q} \log q(x) = H(Q) \] The fact that cross-entropy is bounded below by entropy, \(R(P)\geq R(Q)\) or \(H(Q, P) \geq H(Q)\), is known as Gibbs’ inequality.
The corresponding empirical risk based on the empirical distribution \(\hat{Q}_n\) and a distribution family \(P_{\theta}\) is proportional to the log-likelihood function \[ \hat{R}(\theta) = H(\hat{Q}_n, P_{\theta}) = - \frac{1}{n} \sum_{i=1}^n \log p(x_i; \theta) = - \frac{1}{n} \ell_n(\theta) \] hence minimising the empirical risk \(\hat{R}(\theta)\) is equivalent to maximising the log-likelihood function \(\ell_n(\theta)\).
The divergence resulting from the log-loss is the KL divergence \[ D_{\text{KL}}(Q,P) = H(Q, P) -H(Q) = \text{E}_{Q} \log\left(\frac{q(x)}{p(x)}\right) \]
The KL divergence obeys the data processing inequality, i.e. applying a transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). This property also holds for all \(f\)-divergences (of which the KL divergence is a principal example), but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).
Furthermore, the KL divergence is the only divergence induced by proper scoring rules (and thus the only Bregman divergence), as well as the only \(f\)-divergence, that is invariant against general coordinate transformations. Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.
Other proper scoring rules
The Brier scoring rule, also known as quadratic scoring rule, evaluates a probabilistic categorical forecast \(P\) with corresponding class probabilities \(p_1, \ldots, p_K\). It can be written as \[ \begin{split} S(\boldsymbol x, P) &= \sum_{y=1}^K \left(x_y -p_y\right)^2 \\ &= 1 - 2 \sum_{y=1}^K x_y p_y + \sum_{y=1}^K p_y^2\\ &= 1 - 2 p_k + \sum_{y=1}^K p_y^2\\ \end{split} \] where \(\boldsymbol x\sim Q\) is a realisation from the categorical distribution \(Q\) with class probabilities \(q_1, \ldots, q_K\). The indicator vector \(\boldsymbol x= (x_1, \ldots, x_K)^T = (0, 0, \ldots, 1, \ldots, 0)^T\) contains zeros everywhere except for a single element \(x_k=1\). Unlike the log-score, the Brier score is not local as the pmf for \(P\) is evaluated across all \(K\) classes, not just at the realised class \(k\).
The corresponding risk is \[ R(P) = \text{E}_Q(S(\boldsymbol x, P)) = 1 -2 \sum_{y=1}^K q_y p_y +\sum_{y=1}^K p_y^2 \] which is uniquely minimised for \(P=Q\). Thus, the Brier score is strictly proper. The minimally achieved risk is \[ R(Q) = 1 - \sum_{y=1}^K q_y^2 \] and hence the divergence induced by the Brier score is \[ D(Q, P) = R(P) - R(Q) = \sum_{y=1}^K \left(q_y - p_y\right)^2 \] i.e. the squared Euclidean distance between the two pmfs.
There are many other useful proper scoring rules, including the continuous ranked probability score, the energy score, and the Hyvärinen scoring rule.
See also (Wikipedia): loss function and scoring rule.