4  Evaluation

4.1 Loss functions

Loss function

A loss or cost function \(L(x, a)\) evaluates a prediction \(a\), for example a parameter or a probability distribution, on the basis of an observed outcome \(x\), and returns a numerical score.

A loss function measures, informally, the error between \(x\) and \(a\). During optimisation the prediction \(a\) is varied and the aim is minimisation of the error (hence a loss function has negative orientation, smaller is better).

A utility or reward function is a loss function with a reversed sign (hence it has positive orientation, larger is better).

Risk function

The risk of \(a\) under the distribution \(Q\) for \(x\) is defined as the expected loss \[ R(Q, a) = \operatorname{E}_Q(L(x, a)) \]

The risk is mixture preserving in Q meaning that \[ R( Q_{\lambda}, a ) = (1-\lambda) R(Q_0, a) + \lambda R(Q_1, a) \] for the mixture \(Q_{\lambda}=(1-\lambda) Q_0 + \lambda Q_1\) with \(0 < \lambda < 1\) and \(Q_0 \neq Q_1\). This follows from the linearity of expectation.

The risk of \(a\) under the empirical distribution \(\hat{Q}_n\) obtained from observations \(x_1, \ldots, x_n\) is the empirical risk \[ R(\hat{Q}_n, a) = \frac{1}{n} \sum_{i=1}^{n} L(x_i, a) \] where the expectation is replaced by the sample average.

Minimising risk

Minimising \(R(Q, a)\) with regard to \(a\) finds optimal predictions
\[ a^{\ast} = \underset{a}{\arg \min}\, R(Q, a) \] with associated minimum risk \(R(Q, a^{\ast})\).

Depending on the choice of underlying loss \(L(x, a)\) minimising the risk provides a very general optimisation-based way to identify distributional features of the distribution \(Q\) and to obtain parameter estimates.

Equivalent loss functions

Adding a positive scaling factor \(c > 0\) or an additive term \(k(x)\) to a loss function generates a family of equivalent loss functions \[ L^{\text{equiv}}(x, a) = c L(x, a) + k(x) \] with associated risk \[ R^{\text{equiv}}(Q, a) = c R(Q, a) + \operatorname{E}_Q(k(x)) \] Equivalent losses yield the same risk minimiser \({\arg \min}_a\, R(Q, a)\) and the same loss minimiser \({\arg \min}_a\, L(x, a)\) for fixed \(x\).

4.2 Common loss functions

Squared loss

The squared loss or squared error \[ L_\text{sq}(x,a) = (x-a)^2 \] is one of the most commonly used loss functions. The corresponding risk is the mean squared loss or mean squared error (MSE) \[ R_{\text{sq}}(Q, a) = \operatorname{E}_Q((x-a)^2) \] which is minimised at the mean \(a^{\ast} = \operatorname{E}(Q)\). This follows from \(R_{\text{sq}}(Q, a) = \operatorname{E}_Q(x^2) - 2 a \operatorname{E}_Q(x) + a^2\) and \(dR_{\text{sq}}(Q, a)/da = - 2 \operatorname{E}_Q(x) + 2 a\). The minimum risk \(R_{\text{sq}}(a^{\ast}) = \operatorname{Var}(Q)\) equals the variance.

0-1 loss

The 0-1 loss function can be written as \[ L_{\text{01}}(x, a) = \begin{cases} -[x = a] & \text{discrete case} \\ -\delta(x-a) & \text{continuous case} \\ \end{cases} \] employing the indicator function and Dirac delta function, respectively. The corresponding risk assuming \(x \sim Q\) and pdmf \(q(x)\) is \[ R_{\text{01}}(Q, a) = -q(a) \] which is minimised at the mode of the pdmf.

Asymmetric loss

The asymmetric loss can be defined as \[ L_{\text{asym}}(x, a; \tau) = \begin{cases} 2 \tau (x-a) & \text{for $x\geq a$} \\ 2 (1-\tau) (a-x) & \text{for $x < a$} \\ \end{cases} \] and the corresponding risk is minimised at the quantile \(x_{\tau}\).

Absolute loss

For \(\tau=1/2\) it reduces to the absolute loss \[ L_{\text{abs}}(x, a) = | x - a| \] whose corresponding risk is minimised at the median \(x_{1/2}\).

4.3 Scoring rules

Proper scoring rules

A scoring rule \(S(x, P)\) is special type of loss function1 that assesses the probabilistic forecast \(P\) by assigning a numerical score based on \(P\) and the observed outcome \(x\).

The risk of \(P\) under \(Q\) is the expected score \[ R(Q, P) = \operatorname{E}_{Q}\left(S(x, P)\right) \]

For a proper scoring rule, the risk \(R(Q, P)\) is smallest when the quoted model \(P\) matches the true model \(Q\). The minimal risk, achieved for \(P=Q\), leads to the properness inequality \[ R(Q, P) \geq R(Q, Q) \] For a strictly proper scoring rule, the minimum risk is realised only for the true model, so equality holds exclusively for \(P = Q\).

Score entropy

The minimum risk associated with a proper scoring rule is called the score entropy \(R(Q) = R(Q,Q)\). With it the properness inequality becomes \[ R(Q, P) \geq R(Q) \]

For a proper scoring rule, the score entropy \(R(Q)\) is concave in \(Q\). For a strictly proper scoring rule, the score entropy \(R(Q)\) is strictly concave. This means that \[ R( Q_{\lambda}) \geq (1-\lambda) R(Q_0) + \lambda R(Q_1) \] for the mixture \(Q_{\lambda}=(1-\lambda) Q_0 + \lambda Q_1\) with \(0 < \lambda < 1\) and \(Q_0 \neq Q_1\) (for strict concavity replace \(\geq\) by \(>\)).

This follows from the fact that risk \(R(Q, P)\) is mixture-preserving in \(Q\). Hence, \(R(Q_{\lambda}, Q_{\lambda} ) = (1-\lambda) R(Q_0, Q_{\lambda} ) + \lambda R(Q_1, Q_{\lambda} )\). Applying properness \(R(Q_i, Q_{\lambda} ) \geq R(Q_i)\) with \(i \in \{0,1\}\) yields concavity.

Score divergence

The score divergence between the distributions \(Q\) and \(P\) equals the excess risk given by \[ D(Q, P) = R(Q, P) - R(Q) \] For a proper scoring rule, the divergence \(D(Q, P) \geq 0\) is always non-negative and with \(D(Q, P)=0\) if \(P=Q\). For a strictly proper scoring rule \(D(Q, P)=0\) only when \(P=Q\).

The score divergence \(D(Q, P)\) is convex in \(Q\) for fixed \(P\) for a proper scoring rule. It is strictly convex in \(Q\) for a strictly proper scoring rule. The convexity of \(D(Q,P)\) in \(Q\) derives from the concavity of \(R(Q)\) and the fact that \(R(Q, P)\) is mixture-preserving in \(Q\).

Equivalent scoring rules

Equivalent scoring rules \[ S^{\text{equiv}}(x, P) = c S(x, P) + k(x) \] have associated equivalent score divergences \[ D^{\text{equiv}}(Q, P) = R^{\text{equiv}}(Q, P) - R^{\text{equiv}}(Q) = c D(Q, P) \] Thus, (strictly) proper scoring rules remain (strictly) proper under equivalence transformations. Furthermore, for \(c=1\) equivalent scoring rules are strongly equivalent as their divergences are identical.

Correspondence with Bregman divergences

Proper scoring rules and their associated divergences correspond to Bregman divergences well-known in optimisation and machine learning.

Specifically, the negative score entropy acts as the convex potential \(\Phi(Q) = -R(Q)\) generating the score (Bregman) divergence \(D(Q,P)\) via \[ D(Q,P) = \Phi(Q) -\Phi(P) - \langle \nabla \Phi(P), Q-P \rangle \]

Further properties

Proper scoring rules are highly useful because they enable identification of an underlying distribution and its parameters via risk minimisation or minimisation of the associated divergences. These approaches generalise conventional likelihood and Bayesian methods that are based on the logarithmic scoring rule (Section 4.4.1).

Proper scoring rules also enjoy several additional properties not mentioned above. For example, various decompositions exist for their risk, and the score divergence satisfies a generalised Pythagorean theorem.

4.4 Common scoring rules

Logarithmic scoring rule

The most important scoring rule is the logarithmic scoring rule or log-loss \[ S_{\text{log}}(x, P) = - \log p(x) \]

The risk of \(P\) under \(Q\) based on the log-loss is the mean log-loss \[ R_{\text{log}}(Q, P) = - \operatorname{E}_{Q} \log p(x) = H(Q, P) \] which is uniquely minimised for \(P=Q\). Thus, the log-loss is strictly proper. Moreover, the log-loss is noted as the only local strictly proper scoring rule, as it solely depends on the value of the pdmf at the observed outcome \(x\), and not on any other features of the distribution \(P\).

The mean log-loss is also known as cross-entropy denoted by \(H(Q, P)\).

The minimum risk (score entropy) equals the information entropy denoted by \(H(Q):\) \[ R_{\text{log}}(Q) = -\operatorname{E}_{Q} \log q(x) = H(Q) \] The properness inequality \(H(Q, P) \geq H(Q)\), with equality exclusively for \(P=Q\) and relating cross-entropy and information entropy, is known as Gibbs’ inequality.

The score divergence induced by the log-loss is the Kullback-Leibler (KL) divergence \[ \begin{split} D_{\text{KL}}(Q,P) &= R_{\text{log}}(Q,P) -R_{\text{log}}(Q)) \\ &= H(Q, P) - H(Q) \\ &= \operatorname{E}_{Q} \log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] The KL divergence obeys the data processing inequality, i.e. applying a transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). This property also holds for all \(f\)-divergences (of which the KL divergence is a principal example), but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).

Furthermore, the KL divergence is the only divergence induced by proper scoring rules (and thus the only Bregman divergence), as well as the only \(f\)-divergence, that is invariant against general coordinate transformations. Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.

The empirical risk of a distribution family \(P(\theta)\) based on the log-loss is proportional to the log-likelihood function \[ \begin{split} R_{\text{log}}(\hat{Q}_n, P(\theta)) &= - \frac{1}{n} \sum_{i=1}^n \log p(x_i | \theta) \\ &= - \frac{1}{n} \ell_n(\theta)\\ \end{split} \] Minimising the empirical risk is thus equivalent to maximising the log-likelihood \(\ell_n(\theta)\).

Similarly, minimising the KL divergence \(D_{\text{KL}}(\hat{Q}_n,P(\theta))\) with regard to \(\theta\) is equivalent to minimising the empirical risk and hence to maximum likelihood.

Brier or quadratic scoring rule

The Brier scoring rule, also known as quadratic scoring rule, evaluates a probabilistic categorical forecast \(P\) with corresponding class probabilities \(p_1, \ldots, p_K\) given a realisation \(\boldsymbol x\) from the categorical distribution \(Q\) with class probabilities \(q_1, \ldots, q_K\). It can be written as \[ \begin{split} S_{\text{Brier}}(\boldsymbol x, P) &= \sum_{y=1}^K \left(x_y -p_y\right)^2 \\ &= 1 - 2 \sum_{y=1}^K x_y p_y + \sum_{y=1}^K p_y^2\\ &= 1 - 2 p_k + \sum_{y=1}^K p_y^2\\ \end{split} \] The indicator vector \(\boldsymbol x= (x_1, \ldots, x_K)^T = (0, 0, \ldots, 1, \ldots, 0)^T\) contains zeros everywhere except for a single element \(x_k=1\). Unlike the log-loss, the Brier score is not local as the pmf for \(P\) is evaluated across all \(K\) classes, not just at the realised class \(k\).

The corresponding risk is \[ \begin{split} R_{\text{Brier}}(Q,P) &= \operatorname{E}_Q(S(\boldsymbol x, P)) \\ &= 1 -2 \sum_{y=1}^K q_y p_y +\sum_{y=1}^K p_y^2\\ \end{split} \] which is uniquely minimised for \(P=Q\). Thus, the Brier score is strictly proper.

The minimum risk (score entropy) is \[ R_{\text{Brier}}(Q) = 1 - \sum_{y=1}^K q_y^2 \]

The divergence induced by the Brier score is the squared Euclidean distance between the two pmfs: \[ \begin{split} D_{\text{Brier}}(Q, P) &= R_{\text{Brier}}(Q, P) - R_{\text{Brier}}(Q) \\ & = \sum_{y=1}^K \left(q_y - p_y\right)^2\\ \end{split} \]

Proper but not strictly proper scoring rules

An example of a proper, but not strictly proper, scoring rule is the squared error relative to the mean of the quoted model \(P\): \[ S_{\text{sq}}(x, P) = (x- \operatorname{E}(P))^2 \]

The corresponding risk is \[ \begin{split} R_{\text{sq}}(Q, P) &= \operatorname{E}_Q\left( (x- \operatorname{E}(P))^2 \right)\\ & = (\operatorname{E}(Q)-\operatorname{E}(P))^2 + \operatorname{Var}(Q)\\ \end{split} \] which is minimised at \(P=Q\) but also at any distribution \(P\) with the same mean as \(Q\).

The minimum risk (score entropy) is the variance \[ R_{\text{sq}}(Q) = \operatorname{Var}(Q) \]

The score divergence is the squared distance between the two means \[ \begin{split} D_{\text{sq}}(Q, P) &= R_{\text{sq}}(Q, P) -R_{\text{sq}}(Q)) \\ &= (\operatorname{E}(Q)-\operatorname{E}(P))^2\\ \end{split} \] which vanishes at \(P=Q\) but also at any \(P\) with \(\operatorname{E}(P)=\operatorname{E}(Q)\).

The Dawid-Sebastiani scoring rule is a related scoring rule given by \[ S_{\text{DS}}\left(x, P\right) = \log \operatorname{Var}(P) + \frac{(x-\operatorname{E}(P))^2}{\operatorname{Var}(P)} \] It is equivalent to the log-loss applied to a normal model \(P\).

The corresponding risk is \[ \begin{split} R_{\text{DS}}(Q, P) &= \log \operatorname{Var}(P) + \frac{(\operatorname{E}(Q)-\operatorname{E}(P))^2}{\operatorname{Var}(P)} + \frac{\operatorname{Var}(Q)}{\operatorname{Var}(P)}\\ \end{split} \] which is minimised at \(P=Q\) but also at any distribution \(P\) with \(\operatorname{E}(P)=\operatorname{E}(Q)\) and \(\operatorname{Var}(P)=\operatorname{Var}(Q)\).

The minimum risk (score entropy) is \[ R_{\text{DS}}(Q) = \log \operatorname{Var}(Q) +1 \]

The score divergence is \[ \begin{split} D_{\text{DS}}(Q, P) &= R_{\text{DS}}(Q, P) - R_{\text{DS}}(Q) \\ &= \frac{(\operatorname{E}(Q)-\operatorname{E}(P))^2}{\operatorname{Var}(P)} +\frac{\operatorname{Var}(Q)}{\operatorname{Var}(P)} - \log\left( \frac{\operatorname{Var}(Q}{\operatorname{Var}(P} \right) -1 \\ \end{split} \] which vanishes at \(P=Q\) but also at any \(P\) for which \(\operatorname{E}(P)=\operatorname{E}(Q)\) and \(\operatorname{Var}(P)=\operatorname{Var}(Q)\).

Other strictly proper scoring rules

Other useful strictly proper scoring rules include:

  • the continuous ranked probability score (CRPS),
  • the energy score (multivariate CRPS), and
  • the Hyvärinen scoring rule.

See also (Wikipedia): scoring rule.


  1. Treating scoring rules as loss functions implies a negative orientation. However, some authors adopt the opposite convention and treat scoring rules as positively oriented utility functions.↩︎