3 Transformations
3.1 Affine or location-scale transformation
Transformation rule
Suppose \(x \sim F_x\) is a scalar random variable. The random variable \[y= a + b x\] is a location-scale transformation or affine transformation of \(x\), where \(a\) plays the role of the location parameter and \(b\) is the scale parameter. For \(a=0\) this is a linear transformation. If \(b\neq 0\) then the transformation is invertible, with back-transformation \[x = (y-a)/b\] Invertible transformations provide a one-to-one map between \(x\) and \(y\).
For a random vector \(\boldsymbol x\sim F_{\boldsymbol x}\) of dimension \(d\) the location-scale transformation is \[ \boldsymbol y= \boldsymbol a+ \boldsymbol B\boldsymbol x \] where \(\boldsymbol a\) (a \(m \times 1\) vector) is the location parameter and \(\boldsymbol B\) (a \(m \times d\) matrix) the scale parameter For \(m=d\) (square \(\boldsymbol B\)) and \(\det(\boldsymbol B) \neq 0\) the affine transformation is invertible with back-transformation \[\boldsymbol x= \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\]
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) and assuming an invertible transformation the density for \(y\) is given by \[ f_{y}(y)=|b|^{-1} f_{x} \left( \frac{y-a}{b}\right) \] where \(|b|\) is the absolute value of \(b\). Likewise, assuming an invertible transformation for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is given by \[ f_{\boldsymbol y}(\boldsymbol y)=|\det(\boldsymbol B)|^{-1} f_{\boldsymbol x} \left( \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\right) \] where \(|\det(\boldsymbol B)|\) is the absolute value of the determinant \(\det(\boldsymbol B)\).
Moments
The transformed random variable \(y \sim F_y\) has mean \[\text{E}(y) = a + b \mu_x\] and variance \[\text{Var}(y) = b^2 \sigma^2_x\] where \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original variable \(x\).
The mean and variance of the transformed random vector \(\boldsymbol y\sim F_{\boldsymbol y}\) is \[\text{E}(\boldsymbol y)=\boldsymbol a+ \boldsymbol B\,\boldsymbol \mu_{\boldsymbol x}\] and \[\text{Var}(\boldsymbol y)= \boldsymbol B\,\boldsymbol \Sigma_{\boldsymbol x} \,\boldsymbol B^T\] where \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\).
Importance of affine transformations
The constants \(\boldsymbol a\) and \(\boldsymbol B\) (or \(a\) and \(b\) in the univariate case) are the parameters of the location-scale family \(F_{\boldsymbol y}\) created from \(F_{\boldsymbol x}\). Many important distributions are location-scale families such as the normal distribution (cf. Section 4.3 and Section 5.3) and the location-scale \(t\)-distribution (Section 4.6 and Section 5.6). Furthermore, key procedures in multivariate statistics such as orthogonal transformations (including PCA) or whitening transformations (e.g. the Mahalanobis transformation) are affine transformations.
3.2 General invertible transformation
Transformation rule
As above we assume \(x \sim F_x\) is a scalar random variable and \(\boldsymbol x\sim F_{\boldsymbol x}\) is a random vector.
As a generalisation of invertible affine transformations we now consider general invertible transformations. For a scalar random variable we assume the transformation is specified by \(y(x) = h(x)\) and the back-transformation by \(x(y) = h^{-1}(y)\) For a random vector we assume \(\boldsymbol y(\boldsymbol x) = \boldsymbol h(\boldsymbol x)\) is invertible with back-transformation \(\boldsymbol x(\boldsymbol y) = \boldsymbol h^{-1}(\boldsymbol y)\).
Density
If \(x\) is a continuous random variable with density \(f_{x}(x)\) the density of the transformed variable \(y\) can be computed exactly and is given by \[ f_y(y) =\left| D x(y) \right|\, f_x(x(y)) \] where \(D x(y)\) is the derivative of the inverse transformation \(x(y)\).
Likewise, for a continuous random vector \(\boldsymbol x\) with density \(f_{\boldsymbol x}(\boldsymbol x)\) the density for \(\boldsymbol y\) is obtained by \[ f_{\boldsymbol y}(\boldsymbol y) = |\det\left( D\boldsymbol x(\boldsymbol y) \right)| \,\, f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y) \right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\).
Moments
The mean and variance of the transformed random variable can typically only be approximated. Assume that \(\text{E}(x) = \mu_x\) and \(\text{Var}(x) = \sigma^2_x\) are the mean and variance of the original random variable \(x\) and \(\text{E}(\boldsymbol x)=\boldsymbol \mu_{\boldsymbol x}\) and \(\text{Var}(\boldsymbol x)=\boldsymbol \Sigma_{\boldsymbol x}\) are the mean and variance of the original random vector \(\boldsymbol x\). In the delta method the transformation \(y(x)\) resp. \(\boldsymbol y(\boldsymbol x)\) is linearised around the mean \(\mu_x\) respectively \(\boldsymbol \mu_{\boldsymbol x}\) and the mean and variance resulting from the linear transformation is reported.
Specifically, the linear approximation for the scalar-valued function is \[ y(x) \approx y\left(\mu_x\right) + D y\left(\mu_x\right)\, \left(x-\mu_x\right) \] where \(D y(x) = y'(x)\) is the first derivative of the transformation \(y(x)\) and \(D y\left(\mu_x\right)\) is the first derivative evaluated at the mean \(\mu_x\), and for the vector-valued function \[ \boldsymbol y(\boldsymbol x) \approx \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) + D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \left(\boldsymbol x-\boldsymbol \mu_{\boldsymbol x}\right) \] where \(D \boldsymbol y(\boldsymbol x)\) is the Jacobian matrix (vector derivative) for the transformation \(\boldsymbol y(\boldsymbol x)\) and \(D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\) is the Jacobian matrix evaluated at the mean \(\boldsymbol \mu_{\boldsymbol x}\).
In the univariate case the delta method yields as approximation for the mean and variance of the transformed random variable \(y\) \[ \text{E}(y) \approx y\left(\mu_x\right) \] and \[ \text{Var}(y)\approx \left(D y\left(\mu_x\right)\right)^2 \, \sigma^2_x \]
For the vector random variable \(\boldsymbol y\) the delta method yields \[\text{E}(\boldsymbol y)\approx\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)\] and \[ \text{Var}(\boldsymbol y)\approx D \boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right) \, \boldsymbol \Sigma_{\boldsymbol x} \, D\boldsymbol y\left(\boldsymbol \mu_{\boldsymbol x}\right)^T \]
Assuming \(y(x) = a + b x\), with \(x(y) = (y-a)/b\), \(D y(x) = b\) and \(D x(y) = b^{-1}\), recovers the univariate location-scale transformation. Likewise, assuming \(\boldsymbol y(\boldsymbol x) = \boldsymbol a+ \boldsymbol B\boldsymbol x\), with \(\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}(\boldsymbol y-\boldsymbol a)\), \(D\boldsymbol y(\boldsymbol x) = \boldsymbol B\) and \(D\boldsymbol x(\boldsymbol y) = \boldsymbol B^{-1}\), recovers the multivariate location-scale transformation.
3.3 Exponential tilting and exponential families
Another way to change the distribution of a random variable is by exponential tilting.
Suppose there is a vector valued function \(\boldsymbol t(x)\) where each component is a transformation of \(x\), usually a simple function such the identity \(x\), the square \(x^2\), the logarithm \(\log(x)\) etc. These are called the canonical statistics. Typically, the dimension of \(\boldsymbol t(x)\) is small.
The exponential tilt of a base distribution \(B\) with base function \(h(x)\) (possibly unnormalised) toward the linear combination \(\boldsymbol \eta^T \boldsymbol t(x)\) of the canonical statistics \(\boldsymbol t(x)\) and the canonical parameters \(\boldsymbol \eta\) yields the distribution family \(P(\boldsymbol \eta)\) with pdmf \[ p(x|\boldsymbol \eta) = \underbrace{e^{ \boldsymbol \eta^T \boldsymbol t(x)}}_{\text{exponential tilt}}\, h(x) \, /\, z(\boldsymbol \eta) \] The normaliser or partition function \(z(\boldsymbol \eta)\) ensures that \(p(x|\boldsymbol \eta)\) integrates to one, with \[ z(\boldsymbol \eta) = \int_x \, e^{ \boldsymbol \eta^T \boldsymbol t(x)}\, h(x) \, dx \] In particular, \(z(\mathbf 0)=\int_x h(x) \, dx\) ensures that \[ p(x|\mathbf 0) = h(x)/z(\mathbf 0) = b(x) \] is a valid base pdmf. If \(h(x)\) is a pdmf then \(z(\mathbf 0)=1\) and \(b(x)=h(x)\).
A distribution family \(P(\boldsymbol \eta)\) obtained by exponential tiling is called an exponential family. The set of values of \(\boldsymbol \eta\) for which \(z(\boldsymbol \eta) < \infty\), and hence for which \(p(x|\boldsymbol \eta)\) is well defined, comprises the parameter space of the exponential family. Some choices of \(h(x)\) and \(\boldsymbol t(x)\) do not yield a finite normalising factor for any \(\boldsymbol \eta\) and hence these cannot be used to form an exponential family.
The log-normaliser or log-partition function \(a(\boldsymbol \eta) = \log z(\boldsymbol \eta)\) is the cumulant generating function for the canonical statistics. Its gradient yields the mean \[ \text{E}( \boldsymbol t(x) ) = \boldsymbol \mu_{\boldsymbol t}= \nabla a(\boldsymbol \eta) \] and the Hessian matrix the variance \[ \text{Var}( \boldsymbol t(x) ) = \boldsymbol \Sigma_{\boldsymbol t}= \nabla \nabla^T a(\boldsymbol \eta) \]
Many common distributions are exponential families, such as the normal distribution and the Bernoulli distribution. Exponential families are central in probability and statistics. They support effective statistical learning using likelihood and Bayesian approaches, enable data reduction via minimal sufficiency and provide the basis for generalised linear models. Furthermore, exponential families often allow to generalise results established for specific cases, such as the normal distribution, to a broader domain.
See also (Wikipedia): exponential family — table of distributions.
3.4 Sums of random variables and convolution
Moments
Suppose we have a sum of \(n\) independent random variables. \[ y = x_1 + x_2 + \ldots + x_n \] where each \(x_i \sim F_{x_i}\) has its own distribution and corresponding probability density mass function \(f_{x_i}(x)\).
With \(\boldsymbol x= (x_1, \ldots, x_n)^T\) and \(\mathbf 1_n = (1, 1, \ldots, 1)^T\) the relationship between \(y\) and \(\boldsymbol x\) can be written as affine transformation \(y= \mathbf 1_n^T \boldsymbol x\). Assuming \(\text{E}(x_i) = \mu_i\), \(\text{Var}(x_i) = \sigma^2_i\) and \(\text{Cov}(x_i, x_j)=0\) for \(i\neq j\) the mean and variance of the random variable \(y\) equals (cf. Section 3.1) \[
\text{E}(y) = \mathbf 1_n^T \boldsymbol \mu= \sum_{i=1}^n \mu_i
\] and \[
\text{Var}(y) = \mathbf 1_n^T \, \text{Var}(\boldsymbol x) \, \mathbf 1_n = \sum_{i=1}^n \sigma^2_i
\]
Thus both the means and variance are additive (but note that for the variance this is only true because of the independence assumption).
Convolution
The pdmf for \(y\) is obtained by repeatedly convolving (denoted by the asterisk \(\ast\) operator) the pdmfs of the \(x_i\): \[ f_y(y) = \left(f_{x_1} \ast f_{x_2} \ast \ldots f_{x_n}\right)(y) \]
The convolution of two functions is defined as (continuous case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\int_x f_{x_1}(x)\, f_{x_2}(y-x) dx \] and (discrete case) \[ \left(f_{x_1}\ast f_{x_2}\right)(y)=\sum_x f_{x_1}(x)\, f_{x_2}(y-x) \] Convolution is commutative and associative so so you may convolve multiple pdmfs in any order or grouping. Furthermore, the convolution of pdmfs yields another pdmf, i.e. the resulting function integrates to one.
Many commonly used random variables can be viewed as the outcome of convolutions. For example, the sum of Bernoulli variables yields a binomial random variable and the sum of normal variables yields another normal random variable.
See also (Wikipedia): list of convolutions of probability distributions.
Central limit theorem
The central limit theorem, first postulated by Abraham de Moivre (1667–1754) and later proved by Pierre-Simon Laplace (1749–1827) asserts that the distribution of the sum of \(n\) independent and identically distributed random variables with finite mean and finite variance converges in the limit of large \(n\) to a normal distribution (Section 4.3), even if the individual random variables are not themselves normal. In other words, it asserts that for large \(n\) the convolution of \(n\) identical distributions with finite first two moments converges to the normal distribution.
3.5 Loss functions and scoring rules
Loss function
A loss or cost function \(L(x, a)\) evaluates a prediction \(a\) (for example a parameter or a probability distribution) on the basis of an observed outcome \(x\), and returns a numerical score.
A loss function measures, informally, the error between \(x\) and \(a\). During optimisation the prediction \(a\) is varied and the aim is minimisation of the error (hence a loss function has negative orientation, smaller is better).
Adding a constant or a positive scaling factor to the loss function will not change the location of its minimum, so such loss functions are considered equivalent.
A utility or reward function is a loss function with a reversed sign (hence it has positive orientation, larger is better).
Risk function
The risk of \(a\) under the distribution \(Q\) for \(x\) is defined as the expected loss \[ R_Q(a) = \text{E}_Q(L(x, a)) \] If there is no ambiguity we drop the reference to \(Q\) and write \[ R(a) = \text{E}(L(x, a)) \]
The risk of \(a\) under the empirical distribution \(\hat{Q}_n\) obtained from observations \(x_1, \ldots, x_n\) is the empirical risk \[ \hat{R}(a) = R_{\hat{Q}_n}(a) = \frac{1}{n} \sum_{i=1}^{n} L(x_i, a) \] where the expectation is replaced by the sample average.
Minimising \(R(a)\) finds optimal predictions
\[
a^{\ast} = \underset{a}{\arg \min}\, R(a)
\] Depending on the choice of underlying loss \(L(x, a)\) minimising the risk provides a very general optimisation-based way to identify distributional features of the distribution \(Q\) and to obtain parameter estimates.
Scoring rules
A scoring rule \(S(x, P)\) is special type of loss function1 that assesses the probabilistic forecast \(P\) by assigning a numerical score based on \(P\) and the observed outcome \(x\).
The associated risk of \(P\) under \(Q\) is \[ R_Q(P) = \text{E}_{Q}\left(S(x, P)\right) \] For a proper scoring rule the risk \(R_Q(P)\) is minimised at \(P = Q\), hence \[ R_Q(P) \geq R_Q(Q) \] For a strictly proper scoring rule the minimum is achieved only at the true distribution \(Q\), so equality holds exclusively for \(P = Q\).
A proper scoring rule induces a divergence between the distributions \(Q\) and \(P\), as the difference between the risk and the minimum risk : \[ D(Q, P) = R_Q(P) - R_Q(Q) \geq 0 \] By construction, the divergence \(D(Q, P)\) is always non-negative and equals zero if \(P=Q\). For a stricly proper scoring rule the divergence vanishes exclusively for \(P=Q\).
Proper scoring rules are very useful as they allow to identify the underlying distribution and their parameters by risk minimisation or minimisation of the associated divergences.
Proper scoring rules also have a number of further useful properties. For example, various decompositions exist for their risk, and the divergence satisfies a generalised Pythagorean theorem. Furthermore, there is a correspondence of proper scoring rules and their associated divergences with Bregman divergences.
Common loss functions
The squared loss or squared error is one of the most commonly used loss functions: \[ L(x,a) = (x-a)^2 \] The corresponding risk is the mean squared loss or mean squared error (MSE) \[ R(a) = \text{E}((x-a)^2) \] From \(R(a) = \text{E}((x-a)^2) = \text{E}(x^2) - 2 a \text{E}(x) + a^2\) it follows \(dR(a)/da = - 2 \text{E}(x) + 2 a\) and thus that the MSE is minimised at the mean \(a^{\ast} = \text{E}(x)\). The achieved minimum risk \(R(a^{\ast}) = \text{Var}(x)\) is the variance.
The 0-1 loss function can be written as \[ L(x, a) = \begin{cases} -[x = a] & \text{discrete case} \\ -\delta(x-a) & \text{continuous case} \\ \end{cases} \] employing the indicator function and Dirac delta function, respectively. The corresponding risk assuming \(x \sim Q\) and pdmf \(q(x)\) is \[ R_Q(a) = -q(a) \] which is minimised at the mode of the pdmf.
The asymmetric loss can be defined as \[ L(x, a; \tau) = \begin{cases} 2 \tau (x-a) & \text{for $x\geq a$} \\ 2 (1-\tau) (a-x) & \text{for $x < a$} \\ \end{cases} \] and the corresponding risk is minimised at the quantile \(x_{\tau}\).
For \(\tau=1/2\) it reduces to the absolute loss \[ L(x, a) = | x - a| \] whose corresponding risk is minimised at the median \(x_{1/2}\).
Logarithmic scoring rule
The most important scoring rule is the logarithmic scoring rule or log-loss \[ S(x, P) = - \log p(x) \]
The risk of \(P\) under \(Q\) based on the log-loss is the mean log-loss or cross-entropy \[ R_Q(P) = - \text{E}_{Q} \log p(x) = H(Q,P) \] which is uniquely minimised for \(P=Q\). Thus, the log-loss is strictly proper. Furthermore, the log-loss is notably the only local strictly proper scoring rule, as it solely depends on the value of the pdmf at the observed outcome \(x\), and not on any other features of the distribution \(P\). The minimum risk is the Shannon-Gibbs entropy of \(Q\): \[ R_Q(Q) = -\text{E}_{Q} \log q(x) = H(Q) \] The relationship \(H(Q, P) \geq H(Q)\), with equality exclusively for \(P=Q\), is known as Gibbs’ inequality.
The divergence induced by the log-loss is the Kullback-Leibler (KL) divergence \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q,P) -H(Q) \\ &= \text{E}_{Q} \log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] The KL divergence obeys the data processing inequality, i.e. applying a transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). This property also holds for all \(f\)-divergences (of which the KL divergence is a principal example), but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).
Furthermore, the KL divergence is the only divergence induced by proper scoring rules (and thus the only Bregman divergence), as well as the only \(f\)-divergence, that is invariant against general coordinate transformations. Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.
The empirical risk of a distribution family \(P(\theta)\) based on the log-loss is proportional to the log-likelihood function \[ \begin{split} \hat{R}(\theta) &= H(\hat{Q}_n, P(\theta)) \\ &= - \frac{1}{n} \sum_{i=1}^n \log p(x_i | \theta) \\ &= - \frac{1}{n} \ell_n(\theta)\\ \end{split} \] Minimising the empirical risk \(\hat{R}(\theta)\) is equivalent to maximising the log-likelihood function \(\ell_n(\theta)\).
Similarly, minimising the KL divergence \(D_{\text{KL}}(\hat{Q}_n,P(\theta))\) with regard to \(\theta\) is equivalent to minimising the empirical risk \(\hat{R}(\theta)\) and hence to maximum likelihood.
Brier or quadratic scoring rule
The Brier scoring rule, also known as quadratic scoring rule, evaluates a probabilistic categorical forecast \(P\) with corresponding class probabilities \(p_1, \ldots, p_K\) given a realisation \(\boldsymbol x\) from the categorical distribution \(Q\) with class probabilities \(q_1, \ldots, q_K\). It can be written as \[ \begin{split} S(\boldsymbol x, P) &= \sum_{y=1}^K \left(x_y -p_y\right)^2 \\ &= 1 - 2 \sum_{y=1}^K x_y p_y + \sum_{y=1}^K p_y^2\\ &= 1 - 2 p_k + \sum_{y=1}^K p_y^2\\ \end{split} \] The indicator vector \(\boldsymbol x= (x_1, \ldots, x_K)^T = (0, 0, \ldots, 1, \ldots, 0)^T\) contains zeros everywhere except for a single element \(x_k=1\). Unlike the log-loss, the Brier score is not local as the pmf for \(P\) is evaluated across all \(K\) classes, not just at the realised class \(k\).
The corresponding risk is \[ \begin{split} R_Q(P) &= \text{E}_Q(S(\boldsymbol x, P)) \\ &= 1 -2 \sum_{y=1}^K q_y p_y +\sum_{y=1}^K p_y^2\\ \end{split} \] which is uniquely minimised for \(P=Q\). Thus, the Brier score is strictly proper. The minimum risk is \[ R_Q(Q) = 1 - \sum_{y=1}^K q_y^2 \]
The divergence induced by the Brier score is the squared Euclidean distance between the two pmfs: \[ \begin{split} D(Q, P) &= R_Q(P) - R_Q(Q) \\ & = \sum_{y=1}^K \left(q_y - p_y\right)^2\\ \end{split} \]
Proper but not strictly proper scoring rules
An example of a proper, but not strictly proper, scoring rule is the squared error relative to the mean of the quoted model \(P\): \[ S(x, P) = (x- \text{E}(P))^2 \]
The corresponding risk is \[ \begin{split} R_Q(P) &= \text{E}_Q\left( (x- \text{E}(P))^2 \right)\\ & = (\text{E}(Q)-\text{E}(P))^2 + \text{Var}(Q)\\ \end{split} \] which is minimised at \(P=Q\) but also at any distribution \(P\) with the same mean as \(Q\). The minimum risk is the variance of \(Q\): \[ R_Q(Q) = \text{Var}(Q) \]
The associated divergence is the squared distance between the two means \[ \begin{split} D(Q, P) &= R_Q(P) - \text{Var}(Q) \\ &= (\text{E}(Q)-\text{E}(P))^2\\ \end{split} \] which vanishes at \(P=Q\) but also at any \(P\) with \(\text{E}(P)=\text{E}(Q)\).
The Dawid-Sebastiani scoring rule is a related scoring rule given by \[ S\left(x, P\right) = \log \text{Var}(P) + \frac{(x-\text{E}(P))^2}{\text{Var}(P)} \] It is equivalent to the log-loss applied to a normal model \(P\).
The corresponding risk is \[ \begin{split} R_Q(P) &= \log \text{Var}(P) + \frac{(\text{E}(Q)-\text{E}(P))^2}{\text{Var}(P)} + \frac{\text{Var}(Q)}{\text{Var}(P)}\\ \end{split} \] which is minimised at \(P=Q\) but also at any distribution \(P\) with \(\text{E}(P)=\text{E}(Q)\) and \(\text{Var}(P)=\text{Var}(Q)\).
The minimum risk is \[ R_Q(Q) = \log \text{Var}(Q) +1 \]
The associated divergence is \[ \begin{split} D(Q, P) &= R_Q(P) - \text{Var}(Q) \\ &= \frac{(\text{E}(Q)-\text{E}(P))^2}{\text{Var}(P)} +\frac{\text{Var}(Q)}{\text{Var}(P)} - \log\left( \frac{\text{Var}(Q}{\text{Var}(P} \right) -1 \\ \end{split} \] which vanishes at \(P=Q\) but also at any \(P\) for which \(\text{E}(P)=\text{E}(Q)\) and \(\text{Var}(P)=\text{Var}(Q)\).
Other strictly proper scoring rules
Other useful strictly proper scoring rules include:
- the continuous ranked probability score (CRPS),
- the energy score, and
- the Hyvärinen scoring rule.
See also (Wikipedia): scoring rule.
As a loss function, scoring rules are negatively oriented. However, some authors consider them as utility functions with positive orientation.↩︎