4 Evaluation

4.1 Loss functions

Loss function

A loss or cost function $L(x, a)$ evaluates a prediction $a$, for example a parameter or a probability distribution, on the basis of an observed outcome $x$, and returns a numerical score.

A loss function measures, informally, the error between $x$ and $a$. During optimisation the prediction $a$ is varied and the aim is minimisation of the error (hence a loss function has negative orientation, smaller is better).

A utility or reward function is a loss function with a reversed sign (hence it has positive orientation, larger is better).

Risk function

The risk of $a$ under the distribution $Q$ for $x$ is defined as the expected loss \[ R(Q, a) = \operatorname{E}_Q(L(x, a)) \]

The risk is mixture preserving in Q meaning that \[ R( Q_{\lambda}, a ) = (1-\lambda) R(Q_0, a) + \lambda R(Q_1, a) \] for the mixture $Q_{\lambda}=(1-\lambda) Q_0 + \lambda Q_1$ with $0 < \lambda < 1$ and $Q_0 \neq Q_1$. This follows from the linearity of expectation.

The risk of $a$ under the empirical distribution $\hat{Q}_n$ obtained from observations $x_1, \ldots, x_n$ is the empirical risk \[ R(\hat{Q}_n, a) = \frac{1}{n} \sum_{i=1}^{n} L(x_i, a) \] where the expectation is replaced by the sample average.

Minimising risk

Minimising $R(Q, a)$ with regard to $a$ finds optimal predictions
\[ a^{\ast} = \underset{a}{\arg \min}\, R(Q, a) \] with associated minimum risk $R(Q, a^{\ast})$.

Depending on the choice of underlying loss $L(x, a)$ minimising the risk provides a very general optimisation-based way to identify distributional features of the distribution $Q$ and to obtain parameter estimates.

Equivalent loss functions

Adding a positive scaling factor $k > 0$ or an additive term $c(x)$ to a loss function generates a family of equivalent loss functions \[ L^{\text{equiv}}(x, a) = k L(x, a) + c(x) \] with associated risk \[ R^{\text{equiv}}(Q, a) = k R(Q, a) + \operatorname{E}_Q(c(x)) \] Equivalent losses yield the same risk minimiser ${\arg \min}_a\, R(Q, a)$ and the same loss minimiser ${\arg \min}_a\, L(x, a)$ for fixed $x$.

4.2 Common loss functions

Squared loss

The squared loss or squared error \[ L_\text{sq}(x,a) = (x-a)^2 \] is one of the most commonly used loss functions. The corresponding risk is the mean squared loss or mean squared error (MSE) \[ R_{\text{sq}}(Q, a) = \operatorname{E}_Q((x-a)^2) \] which is minimised at the mean $a^{\ast} = \operatorname{E}(Q)$. This follows from $R_{\text{sq}}(Q, a) = \operatorname{E}_Q(x^2) - 2 a \operatorname{E}_Q(x) + a^2$ and $dR_{\text{sq}}(Q, a)/da = - 2 \operatorname{E}_Q(x) + 2 a$. The minimum risk $R_{\text{sq}}(a^{\ast}) = \operatorname{Var}(Q)$ equals the variance.

0-1 loss

The 0-1 loss function can be written as \[ L_{\text{01}}(x, a) = \begin{cases} -[x = a] & \text{discrete case} \\ -\delta(x-a) & \text{continuous case} \\ \end{cases} \] employing the indicator function and Dirac delta function, respectively. The corresponding risk assuming $x \sim Q$ and pdmf $q(x)$ is \[ R_{\text{01}}(Q, a) = -q(a) \] which is minimised at the mode of the pdmf.

Asymmetric loss

The asymmetric loss can be defined as \[ L_{\text{asym}}(x, a; \tau) = \begin{cases} 2 \tau (x-a) & \text{for $x\geq a$} \\ 2 (1-\tau) (a-x) & \text{for $x < a$} \\ \end{cases} \] and the corresponding risk is minimised at the quantile $x_{\tau}$.

Absolute loss

For $\tau=1/2$ it reduces to the absolute loss \[ L_{\text{abs}}(x, a) = | x - a| \] whose corresponding risk is minimised at the median $x_{1/2}$.

4.3 Scoring rules

Proper scoring rules

A scoring rule $S(x, P)$ is special type of loss function¹ that assesses the probabilistic forecast $P$ by assigning a numerical score based on $P$ and the observed outcome $x$.

The risk of $P$ under $Q$ is the expected score \[ R(Q, P) = \operatorname{E}_{Q}\left(S(x, P)\right) \]

For a proper scoring rule, the risk $R(Q, P)$ is smallest when the quoted model $P$ matches the true model $Q$. The minimal risk, achieved for $P=Q$, leads to the properness inequality \[ R(Q, P) \geq R(Q, Q) \] For a strictly proper scoring rule, the minimum risk is realised only for the true model, so equality holds exclusively for $P = Q$.

Proper scoring rules are useful because they enable identification and approximation of data-generating distributions and their parameters via risk minimisation or minimisation of the associated scoring-rule divergences. This allows to generalise conventional statistical approaches based on the logarithmic scoring rule (Section 4.4.1).

Scoring-rule entropy

The minimum risk associated with a proper scoring rule is called the scoring-rule entropy $R(Q) = R(Q,Q)$. With it the properness inequality becomes \[ R(Q, P) \geq R(Q) \]

For a proper scoring rule, the entropy $R(Q)$ is concave in $Q$. For a strictly proper scoring rule, the entropy $R(Q)$ is strictly concave. This means that \[ R( Q_{\lambda}) \geq (1-\lambda) R(Q_0) + \lambda R(Q_1) \] for the mixture $Q_{\lambda}=(1-\lambda) Q_0 + \lambda Q_1$ with $0 < \lambda < 1$ and $Q_0 \neq Q_1$ (for strict concavity replace $\geq$ by $>$).

This follows from the fact that the risk $R(Q, P)$ is mixture-preserving in $Q$. Hence, $R(Q_{\lambda}) = R(Q_{\lambda}, Q_{\lambda} ) = (1-\lambda) R(Q_0, Q_{\lambda} ) + \lambda R(Q_1, Q_{\lambda} )$. Applying properness $R(Q_i, Q_{\lambda} ) \geq R(Q_i)$ with $i \in \{0,1\}$ yields concavity.

Scoring-rule divergence

The scoring-rule divergence between the distributions $Q$ and $P$ equals the excess risk given by \[ D(Q, P) = R(Q, P) - R(Q) \] For a proper scoring rule, the divergence $D(Q, P) \geq 0$ is always non-negative and with $D(Q, P)=0$ if $P=Q$. For a strictly proper scoring rule $D(Q, P)=0$ only when $P=Q$.

The scoring-rule divergence $D(Q, P)$ is convex in $Q$ for fixed $P$ for a proper scoring rule. It is strictly convex in $Q$ for a strictly proper scoring rule. The convexity of $D(Q,P)$ in $Q$ derives from the concavity of $R(Q)$ and the fact that $R(Q, P)$ is mixture-preserving in $Q$.

Divergences induced by proper scoring rules correspond to Bregman divergences applied to probability distributions, with the convex generator being the negative entropy.

Equivalent scoring rules

Equivalent scoring rules \[ S^{\text{equiv}}(x, P) = k S(x, P) + c(x) \] have associated equivalent divergences \[ D^{\text{equiv}}(Q, P) = R^{\text{equiv}}(Q, P) - R^{\text{equiv}}(Q) = k D(Q, P) \] Thus, (strictly) proper scoring rules remain (strictly) proper under equivalence transformations. Furthermore, for $k=1$ equivalent scoring rules are strongly equivalent as their divergences are identical.

Further properties

Proper scoring rules also enjoy several additional properties not mentioned above. For example, various decompositions exist for their risk, and the scoring-rule divergence satisfies a generalised Pythagorean theorem.

4.4 Common scoring rules

Logarithmic scoring rule

The most important scoring rule is the logarithmic scoring rule or log-loss \[ S_{\text{log}}(x, P) = - \log p(x) \]

The risk of $P$ under $Q$ based on the log-loss is the mean log-loss \[ R_{\text{log}}(Q, P) = - \operatorname{E}_{Q} \log p(x) = H(Q, P) \] which is uniquely minimised for $P=Q$. Thus, the log-loss is strictly proper. Moreover, the log-loss is noted as the only local strictly proper scoring rule, as it solely depends on the value of the pdmf at the observed outcome $x$, and not on any other features of the distribution $P$.

The mean log-loss is also known as cross-entropy denoted by $H(Q, P)$.

The minimum risk (scoring-rule entropy) equals the information entropy denoted by $H(Q):$ \[ R_{\text{log}}(Q) = -\operatorname{E}_{Q} \log q(x) = H(Q) \] The properness inequality $H(Q, P) \geq H(Q)$, with equality exclusively for $P=Q$ and relating cross-entropy and information entropy, is known as Gibbs’ inequality.

The divergence induced by the log-loss is the Kullback-Leibler (KL) divergence \[ \begin{split} D_{\text{KL}}(Q,P) &= R_{\text{log}}(Q,P) -R_{\text{log}}(Q)) \\ &= H(Q, P) - H(Q) \\ &= \operatorname{E}_{Q} \log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \]

The KL divergence is invariant under a change of variables. Specifically, for arbitrary one-to-one transformations between $x$ and $y$, even for continuous random variables, $D_{\text{KL}}(Q_x,P_x) = D_{\text{KL}}(Q_y,P_y)$.

More generally, the KL divergence also satisfies the data processing inequality (DPI). Applying a transformation (stochastic or deterministic, possibly coarsening) that produces $y$ from $x$, cannot increase the divergence, so that $D_{\text{KL}}(Q_x,P_x) \geq D_{\text{KL}}(Q_y,P_y)$. Note that a change of variables is a special case of data processing (with identity).

Divergences between distributions satisfying the DPI (and hence being invariant under a change of variables) form the class of $f$-divergences. The KL divergence is the only divergence induced by a proper scoring rule (i.e. the only Bregman divergence) that is also an $f$-divergence.

The empirical risk of a distribution family $P(\theta)$ based on the log-loss is proportional to the log-likelihood function \[ \begin{split} R_{\text{log}}(\hat{Q}_n, P(\theta)) &= - \frac{1}{n} \sum_{i=1}^n \log p(x_i | \theta) \\ &= - \frac{1}{n} \ell_n(\theta)\\ \end{split} \] Minimising the empirical risk is thus equivalent to maximising the log-likelihood $\ell_n(\theta)$.

Similarly, minimising the KL divergence $D_{\text{KL}}(\hat{Q}_n,P(\theta))$ with regard to $\theta$ is equivalent to minimising the empirical risk and hence to maximum likelihood.

Quadratic or Brier scoring rule

Assume the categorical distributions $Q=\operatorname{Cat}(\boldsymbol q)$ with class probabilities $\boldsymbol q=(q_1, \ldots, q_K)^T$ and $P=\operatorname{Cat}(\boldsymbol p)$ with corresponding class probabilities $\boldsymbol p= (p_1, \ldots, p_K)^T$ and an indicator vector $\boldsymbol x= (x_1, \ldots, x_K)^T = (0, 0, \ldots, 1, \ldots, 0)^T$ containing zeros everywhere except for a single element $x_k=1$.

The quadratic scoring rule, also known as Brier scoring rule, evaluates the categorical forecast $P$ given a realisation $\boldsymbol x$ from the categorical distribution $Q$ using the squared Euclidean distance between $\boldsymbol x$ and $\boldsymbol p$: \[ \begin{split} S_{\text{Brier}}(\boldsymbol x, P) & = ||\boldsymbol x-\boldsymbol p||^2 \\ &= (\boldsymbol x-\boldsymbol p)^T(\boldsymbol x-\boldsymbol p) \\ &= 1 - 2 \boldsymbol x^T \boldsymbol p+ \boldsymbol p^T \boldsymbol p\\ &= 1 - 2 p_k + \boldsymbol p^T \boldsymbol p\\ \end{split} \] Unlike the log-loss, the Brier score is not local as the pmf for $P$ is evaluated across all $K$ classes, not just at the realised class $k$.

The corresponding risk is \[ \begin{split} R_{\text{Brier}}(Q,P) &= \operatorname{E}_Q(S(\boldsymbol x, P)) \\ &= 1 -2 \boldsymbol q^T \boldsymbol p+ \boldsymbol p^T \boldsymbol p\\ \end{split} \] which is uniquely minimised for $P=Q$ with $\boldsymbol p=\boldsymbol q$. Thus, the Brier score is strictly proper.

The minimum risk (scoring-rule entropy) is \[ R_{\text{Brier}}(Q) = 1 - \boldsymbol q^T \boldsymbol q \]

The divergence induced by the Brier score is the squared Euclidean distance between $\boldsymbol q$ and $\boldsymbol p$: \[ \begin{split} D_{\text{Brier}}(Q, P) &= R_{\text{Brier}}(Q, P) - R_{\text{Brier}}(Q) \\ & = (\boldsymbol q-\boldsymbol p)^T(\boldsymbol q-\boldsymbol p)\\ & = ||\boldsymbol q-\boldsymbol p||^2 \\ \end{split} \]

Spherical scoring rule

The cosine similarity between two vectors $\boldsymbol a$ and $\boldsymbol b$ is the cosine of the angle $\phi$ between the two vectors: \[ \operatorname{cos\_sim}(\boldsymbol a, \boldsymbol b) = \cos \phi(\boldsymbol a, \boldsymbol b) = \frac{\boldsymbol a\cdot \boldsymbol b}{||a|| \, ||b||} \] with $\boldsymbol a\cdot \boldsymbol b= \boldsymbol a^T \boldsymbol b$, $||a|| = (\boldsymbol a^T \boldsymbol a)^{1/2}$ and $||b|| = (\boldsymbol b^T \boldsymbol b)^{1/2}$.

The cosine distance is the complement \[ \operatorname{cos\_dist}(\boldsymbol a, \boldsymbol b) = 1 -\operatorname{cos\_sim}(\boldsymbol a, \boldsymbol b) \]

The spherical scoring rule evaluates the categorical forecast $P=\operatorname{Cat}(\boldsymbol p)$ given a realisation $\boldsymbol x$ from the categorical distribution $Q=\operatorname{Cat}(\boldsymbol q)$ using the negative cosine similarity between the probability vectors $\boldsymbol x$ and $\boldsymbol p$: \[ \begin{split} S_{\operatorname{sph}}(\boldsymbol x, P) & = -\operatorname{cos\_sim}(\boldsymbol x, \boldsymbol p)\\ &= -\boldsymbol x^T \boldsymbol p/ ||\boldsymbol p|| \\ &= -p_k / ||\boldsymbol p|| \\ \end{split} \] The spherical score is not local as the pmf for $P$ is evaluated across all $K$ classes, not just at the realised class $k$. Depending on $\boldsymbol p$, it may range from a minimum of $-1$ (angle $\phi=0$, zero degrees) and to a maximum of $0$ (angle, $\phi=\pi/2$, 90 degrees).

The corresponding risk is \[ \begin{split} R_{\text{sph}}(Q,P) &= \operatorname{E}_Q(S(\boldsymbol x, P)) \\ &= -\boldsymbol q^T \boldsymbol p/ ||\boldsymbol p|| \\ &= -||\boldsymbol q|| \, \operatorname{cos\_sim}(\boldsymbol q, \boldsymbol p) \\ \end{split} \] which is proportional to the negative cosine similarity between $\boldsymbol q$ and $\boldsymbol p$. It is uniquely minimised for $P=Q$ with $\boldsymbol p=\boldsymbol q$. Therefore, the spherical scoring rule is strictly proper.

The minimum risk (scoring-rule entropy) is \[ R_{\text{sph}}(Q) = -||\boldsymbol q|| \]

The divergence induced by the spherical score is \[ \begin{split} D_{\text{sph}}(Q, P) &= R_{\text{sph}}(Q, P) - R_{\text{sph}}(Q) \\ & = -\boldsymbol q^T \boldsymbol p/ ||\boldsymbol p|| + ||\boldsymbol q|| \\ & = ||\boldsymbol q|| \, \operatorname{cos\_dist}(\boldsymbol q, \boldsymbol p) \\ \end{split} \] and hence is proportional to the cosine distance between $\boldsymbol q$ and $\boldsymbol p$.

Proper but not strictly proper scoring rules

An example of a proper, but not strictly proper, scoring rule is the squared error relative to the mean of the quoted model $P$: \[ S_{\text{sq}}(x, P) = (x- \operatorname{E}(P))^2 \]

The corresponding risk is \[ \begin{split} R_{\text{sq}}(Q, P) &= \operatorname{E}_Q\left( (x- \operatorname{E}(P))^2 \right)\\ & = (\operatorname{E}(Q)-\operatorname{E}(P))^2 + \operatorname{Var}(Q)\\ \end{split} \] which is minimised at $P=Q$ but also at any distribution $P$ with the same mean as $Q$.

The minimum risk (scoring-rule entropy) is the variance \[ R_{\text{sq}}(Q) = \operatorname{Var}(Q) \]

The scoring-rule divergence is the squared distance between the two means \[ \begin{split} D_{\text{sq}}(Q, P) &= R_{\text{sq}}(Q, P) -R_{\text{sq}}(Q)) \\ &= (\operatorname{E}(Q)-\operatorname{E}(P))^2\\ \end{split} \] which vanishes at $P=Q$ but also at any $P$ with $\operatorname{E}(P)=\operatorname{E}(Q)$.

The Dawid-Sebastiani scoring rule is a related scoring rule given by \[ S_{\text{DS}}\left(x, P\right) = \log \operatorname{Var}(P) + \frac{(x-\operatorname{E}(P))^2}{\operatorname{Var}(P)} \] It is equivalent to the log-loss applied to a normal model $P$.

The corresponding risk is \[ \begin{split} R_{\text{DS}}(Q, P) &= \log \operatorname{Var}(P) + \frac{(\operatorname{E}(Q)-\operatorname{E}(P))^2}{\operatorname{Var}(P)} + \frac{\operatorname{Var}(Q)}{\operatorname{Var}(P)}\\ \end{split} \] which is minimised at $P=Q$ but also at any distribution $P$ with $\operatorname{E}(P)=\operatorname{E}(Q)$ and $\operatorname{Var}(P)=\operatorname{Var}(Q)$.

The minimum risk (scoring-rule entropy) is \[ R_{\text{DS}}(Q) = \log \operatorname{Var}(Q) +1 \]

The scoring-rule divergence is \[ \begin{split} D_{\text{DS}}(Q, P) &= R_{\text{DS}}(Q, P) - R_{\text{DS}}(Q) \\ &= \frac{(\operatorname{E}(Q)-\operatorname{E}(P))^2}{\operatorname{Var}(P)} +\frac{\operatorname{Var}(Q)}{\operatorname{Var}(P)} - \log\left( \frac{\operatorname{Var}(Q}{\operatorname{Var}(P} \right) -1 \\ \end{split} \] which vanishes at $P=Q$ but also at any $P$ for which $\operatorname{E}(P)=\operatorname{E}(Q)$ and $\operatorname{Var}(P)=\operatorname{Var}(Q)$.

Other strictly proper scoring rules

Other useful strictly proper scoring rules include:

the continuous ranked probability score (CRPS),
the energy score (multivariate CRPS), and
the Hyvärinen scoring rule.