4  Risk and divergence

This chapter introduces two measures of risk and divergence: the mean log-loss, also known as cross-entropy, and the Kullback-Leibler (KL) divergence, also known as relative entropy.

4.1 Risk

Mean log-loss or cross-entropy

The risk is defined as the expected loss. For the logarithmic scoring rule \(S(x, P) = -\log p(x)\) the risk of \(P\) under \(Q\) is the mean log-loss \[ H(Q,P) = - \operatorname{E}_Q \log p(x) \] which is also known as cross-entropy (see Note 6.1).

The risk is a functional of two distributions \(Q\) and \(P\), where \(Q\) represents the data-generating process (note the expectation \(\operatorname{E}_Q\) with regard to \(Q\)) and the distribution \(P\) is the model that is evaluated on the observations. Hence, the two arguments are not interchangeable and \(H(Q, P\)) is not symmetric with regard to \(Q\) and \(P\).

The risk, like entropy, is only defined up to an affine transformation with a positive scaling factor, yielding the equivalence class \[ H(Q, P)^{\text{equiv}}= - k \operatorname{E}_Q\left(\log p(x)\right) + c \] For fixed \(Q\) all equivalent risks have the same minimiser for \(P\).

Note 4.1: Terminology for cross-entropy

The term “cross-entropy” is used inconsistently in the literature, so we will generally use “mean log-loss” or “risk” instead.

In contemporary usage, “cross-entropy” typically refers to the mean log-loss (a risk to be minimised). However, many authors have also used “cross-entropy” to mean the KL divergence (relative entropy).

Mixture preserving

The risk is mixture preserving in Q meaning that \[ H( Q_{\lambda}, P ) = (1-\lambda) H(Q_0, P) + \lambda H(Q_1, P) \] for the mixture \(Q_{\lambda}=(1-\lambda) Q_0 + \lambda Q_1\) with \(0 < \lambda < 1\) and \(Q_0 \neq Q_1\). This follows from the linearity of expectation.

Hence, unlike entropy \(H(Q)\), the mean log-loss \(H(Q, P)\) is not concave in \(Q\).

Properness inequality

By construction, for \(P=Q\) the mean log-loss reduces to the information entropy \(H(Q) = H(Q,Q)\). Moreover, \(H(Q, P)\) is uniquely minimised for \(P=Q\) so that the mean log-loss also satisfies the properness inequality \[ H(Q, P) \geq H(Q) \] with equality only if \(P=Q\). This is also known as Gibbs’ inequality.

That information entropy is the unique lower bound of the mean log-loss can be established using Jensen’s inequality for convex functions (recall that \(-\log x\) is strictly convex).

The logarithmic scoring rule (log-loss) is called strictly proper because the associated risk (mean log-loss) satisfies the properness inequality.

Properness is what makes the log-scoring rule (and indeed other proper scoring rules) useful as it allows to identify the underlying true model by risk minimisation.

If a scoring rule is (strictly) proper, then the associated entropy is (strictly) concave (see Section 3.2.2).

Interpretation

The mean log-loss can be interpreted as the expected cost or expected code length when the data are generated according to model \(Q\) (“sender”, “encoder”) and but we employ model \(P\) to describe the data (“receiver”, “decoder”).

As a consequence of properness, when data are generated (encoded) under model \(Q\) and described (decoded) using model \(P\) there is always an extra cost, or penalty, to employ the approximating model \(P\) rather than the correct model \(Q\).

Mean log-loss for discrete and continuous distributions

For two discrete distributions \(Q\) and \(P\) with pmf \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the mean log-loss is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} q(x) \, \log p(x) \] Similarly, for two continuous distributions \(Q\) and \(P\) with pdfs \(q(x)\) and \(p(x)\) we compute the integral \[ H(Q, P) =- \int_x q(x) \, \log p(x) \, dx \]

Like differential entropy the mean log-loss is not invariant under a change of variables for continuous random variables, such as from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).

Examples

Example 4.1 Mean log-loss for two normal distributions:

Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). The mean log-loss is \[ \begin{split} H(P_1, P_2) &= -\operatorname{E}_{P_1} \left( \log p(x |\mu_2, \sigma^2_2) \right)\\ &= \frac{1}{2} \operatorname{E}_{P_1} \left( \log(2\pi\sigma^2_2) + \frac{(x-\mu_2)^2}{\sigma^2_2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \\ \end{split} \] using \(\operatorname{E}_{P_1} ((x-\mu_2)^2) = (\mu_1-\mu_2)^2 + \sigma^2_1\).

Example 4.2 Normal mean log-loss reduces to normal differential entropy:

The mean log-loss for two normal distributions \(P_1\) and \(P_2\) is (Example 4.1) \[ H(P_1,P_2) = \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \] Setting \(P_1=P_2 = P\) with \(\mu_1 = \mu_2 = \mu\) and \(\sigma^2_1 = \sigma^2_2 = \sigma^2\) the mean log-loss degenerates to \[ H(P) = \frac{1}{2} \left(\log( 2 \pi \sigma^2) +1 \right) \] which is the normal differential entropy (Example 3.9).

Example 4.3 Empirical mean log-loss and log-likelihood

In the empirical mean log-loss the expectation is taken with regard to the empirical distribution \(\hat{Q}_n\). For a model family \(P(\boldsymbol \theta)\) it is \[ \begin{split} H(\hat{Q}_n, P(\boldsymbol \theta)) & = - \operatorname{E}_{\hat{Q}_n} (\log p(x|\boldsymbol \theta)) \\ & = -\frac{1}{n} \sum_{i=1}^n \log p(x_i | \boldsymbol \theta) \\ & = -\frac{1}{n} \ell_n ({\boldsymbol \theta}) \end{split} \] The empirical risk is equal to the negative log-likelihood \(\ell_n ({\boldsymbol \theta})\) standardised by the sample size \(n\). Conversely, the log-likelihood is the negative empirical risk based on the log-loss multiplied by sample size \(n\).

Therefore, maximising the log-likelihood is equal to minimising the empirical risk based on the log-loss

Example 4.4 \(\color{Red} \blacktriangleright\) Mean log-loss for an exponential family:

Assume \(P(\boldsymbol \eta)\) is an exponential family with canonical parameters \(\boldsymbol \eta\), canonical statistics \(\boldsymbol t(x)\) and log-partition function \(a(\boldsymbol \eta)\) with log-pdmf
\(\log p(x|\boldsymbol \eta) = \langle \boldsymbol \eta, \boldsymbol t(x)\rangle + \log h(x) - a(\boldsymbol \eta)\).

Then with \(\operatorname{E}_{ P(\boldsymbol \eta_1)}(\boldsymbol t(x))=\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta_1)\) the mean log-loss is \[ \begin{split} H(P(\boldsymbol \eta_1), P(\boldsymbol \eta_2) ) &= -\operatorname{E}_{P(\boldsymbol \eta_1)}( \log p(x|\boldsymbol \eta_2) ) \\ &= a(\boldsymbol \eta_2) -\langle \boldsymbol \eta_2, \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta_1) \rangle -\operatorname{E}_{P(\boldsymbol \eta_1)}( \log h(x) ) \end{split} \] For \(\boldsymbol \eta_1 = \boldsymbol \eta_2 = \boldsymbol \eta\) it reduces to \[ H(P(\boldsymbol \eta) ) = a(\boldsymbol \eta) -\langle \boldsymbol \eta, \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta) \rangle -\operatorname{E}_{P(\boldsymbol \eta)}( \log h(x) ) \] the entropy of an exponential family (Example 3.10).

4.2 Divergence

Kullback-Leibler divergence

The divergence1 between the two distributions \(Q\) and \(P\) based on the logarithmic scoring rule is the difference between the mean log-loss (risk) and the information entropy (minimum risk). This yields the Kullback-Leibler (KL) divergence between \(Q\) and \(P\): \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q, P)-H(Q) \\ & = \operatorname{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] By construction, as a consequence of properness, the KL divergence is always non-negative, \(D_{\text{KL}}(Q, H) \geq 0\). Furthermore, \(D_{\text{KL}}(Q, P) = 0\) only for \(P=Q\) (strict properness)

The use of the term “divergence” rather than “distance” is a reminder that \(Q\) and \(P\) are not interchangeable.

Using the KL divergence, we can write the mean log-loss as \[ \begin{split} H(Q, P) &= \underbrace{D_{\text{KL}}(Q,P)}_{\geq 0} + H(Q) \\ & \geq H(Q) \\ \end{split} \] which also includes the properness inequality.

The divergence is only defined up to a positive scaling factor \(k\) that effectively determines the base of the logarithm: \[ D_{\text{KL}}(Q,P)^{\text{equiv}}= H(Q, P)^{\text{equiv}}-H(Q)^{\text{equiv}} = k \operatorname{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ \]

Note 4.2: Terminology for KL divergence

Various notations exist for KL divergence, we use \(D_{\text{KL}}(Q, P)\) but \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\) are also common.

The KL divergence is also called KL information, KL information number or discrimination information. Other names include information divergence or \(\boldsymbol I\)-divergence. Some authors call \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) the deviance of \(P\) from \(Q\).

KL divergence is frequently termed relative entropy, but this label is ambiguous as some authors use it for the negative of KL divergence (i.e. for information entropy with prior measure, see Section 6.1). Shannon (1948) also used the “relative” entropy to mean entropy standardised by its maximum.

Convexity

By construction, \(D_{\text{KL}}(Q,P)\) is strictly convex in \(Q\) as \(H(Q)\) is strictly concave in \(Q\) and \(H(Q, P)\) is mixture-preserving.

Interpretation

The KL divergence \(D_{\text{KL}}(Q, P)\) can be interpreted as the excess risk or additional cost incurred if model \(P\) is used to describe data that actually follow \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). However, if they differ then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).

Invariance under a change of variables

A crucial property of KL divergence is that it is invariant with respect to reparametrisation of the sample space. Specifically, \[ D_{\text{KL}}(Q_y, P_y) = D_{\text{KL}}(Q_x, P_x) \] under a general invertible variable transformations of the random variable from \(x\) to \(y\).

This is a remarkable property as it holds not just for discrete but also for continuous random variables. For comparison, recall that both entropy and as mean log-loss are not invariant under a change of variables for continuous random variables.

In the definition of KL divergence the expectation is taken over a ratio of two densities. The invariance is a consequence that the Jacobian determinant changes both densities in the same way under a variable transformation and that the two identical factors cancel out in the ratio.2.

Data-processing inequality

More generally, the KL divergence also satisfies the data-processing inequality. This states that KL divergence cannot increase under a data-processing map (stochastic or deterministic, possibly coarsening) that produces \(y\) from \(x\), so that \[ D_{\text{KL}}(Q_y,P_y) \leq D_{\text{KL}}(Q_x,P_x) \] For a lossless transformation, such as an invertible change of variables, the inequality becomes an identity.

Divergences between distributions satisfying the data-processing inequality (and hence being invariant under a change of variables) form the class of \(f\)-divergences. The KL divergence is the only divergence induced by a proper scoring rule (i.e. the only Bregman divergence) that is also an \(f\)-divergence.

Further properties

The KL divergence and mean log-loss inherit a number of further useful properties from proper scoring rules. We will not cover these in this text. For example, there are various decompositions for the risk and the divergence satisfies a generalised Pythagorean theorem.

In summary, KL divergence (and the underlying log-loss) stands out among divergences between distributions that is features many unique and desirable properties in a single quantity. It is therefore not surprising that it plays a central role in statistics and machine learning.

Examples

Example 4.5 KL divergence between two Bernoulli distributions \(B_1 = \operatorname{Ber}(\theta_1)\) and \(B_2 = \operatorname{Ber}(\theta_2)\):

The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(B_1, B_2)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]

Example 4.6 KL divergence between two categorical distributions with \(K\) classes:

With \(Q=\operatorname{Cat}(\boldsymbol q)\) and \(P=\operatorname{Cat}(\boldsymbol p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:

\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]

To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).

Example 4.7 KL divergence between two univariate normals with different means and common variance:

Assume \(P_1=N(\mu_1,\sigma^2)\) and \(P_2=N(\mu_2,\sigma^2)\). Setting \(\sigma^2_1 =\sigma^2_2 = \sigma^2\) in the more general case of the KL divergence between two normals (Example 4.9) yields \[D_{\text{KL}}(P_1, P_2 )=\frac{1}{2\sigma^2} (\mu_1-\mu_2)^2 \] which, apart from a scale factor, is the squared Euclidean distance or squared loss between the two means \(\mu_1\) and \(\mu_2\). Note that in this case the KL divergence is symmetric with regard to the two mean parameters.

Example 4.8 KL divergence between two univariate normals with common mean and different variances:

Assume \(P_1=N(\mu,\sigma^2_1\) and \(P_2=N(\mu,\sigma^2_2)\). Setting \(\mu_1 = \mu_2 = \mu\) in the more general case of the KL divergence between two normals (Example 4.9) yields

\[ D_{\text{KL}}(P_1,P_2) = \frac{1}{2} \left( \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \] This is a convex function of the ratio \(\sigma_1^2/\sigma^2_2\) of the two variances. Apart from the scale factor this is known as Stein’s loss between the two variances \(\sigma^2_1\) and \(\sigma^2_2\).

Example 4.9 KL divergence between two univariate normals with different means and variances:

Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). Then with Example 3.9 (entropy of normal) and Example 4.1 (mean log-loss between two normals) we get \[ \begin{split} D_{\text{KL}}(P_1,P_2) &= H(P_1, P_2) - H(P_1) \\ &= \frac{1}{2} \left( \frac{(\mu_1-\mu_2)^2}{\sigma^2_2} + \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \\ \end{split} \]

For the two special cases of equal variances (\(\sigma^2_1 =\sigma^2_2 = \sigma^2\)) and equal means \((\mu_1 = \mu_2 = \mu\)) see Example 4.7 and Example 4.8.

Example 4.10 \(\color{Red} \blacktriangleright\) KL divergence between two members of an exponential family:

Assume \(P(\boldsymbol \eta)\) is an exponential family with canonical parameters \(\boldsymbol \eta\), canonical statistics \(\boldsymbol t(x)\) and log-partition function \(a(\boldsymbol \eta)\) with log-pdmf
\(\log p(x|\boldsymbol \eta) = \langle \boldsymbol \eta, \boldsymbol t(x)\rangle + \log h(x) - a(\boldsymbol \eta)\).

Then with \(\operatorname{E}_{ P(\boldsymbol \eta_1)}(\boldsymbol t(x))=\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta_1)\) the KL divergence between two distributions \(P(\boldsymbol \eta_1)\) and \(P(\boldsymbol \eta_2)\) in this family is \[ \begin{split} D_{\text{KL}}(P(\boldsymbol \eta_1), P(\boldsymbol \eta_2) ) &= \operatorname{E}_{P(\boldsymbol \eta_1)}( \log p(x|\boldsymbol \eta_1) - \log p(x|\boldsymbol \eta_2)) \\ &= \langle \boldsymbol \eta_1-\boldsymbol \eta_2, \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta_1)\rangle - \left( a(\boldsymbol \eta_1) - a(\boldsymbol \eta_2) \right) \end{split} \]

This can also be obtained by computing \[ D_{\text{KL}}(P(\boldsymbol \eta_1), P(\boldsymbol \eta_2) ) = H(P(\boldsymbol \eta_1), P(\boldsymbol \eta_2) ) - H(P(\boldsymbol \eta_1) ) \] using the results from Example 3.10 and Example 4.4.

4.3 Further reading

A brief overview of proper scoring rules along with corresponding divergences is found in the supplementary Probability and Distribution Refresher notes.

A broad review of scoring rules, including a discussion of properness, is found, e.g., in Winkler (1996).

See also: Bregman divergences (Wikipedia) and \(f\)-divergences (Wikipedia)

TipA bit of history

Solomon Kullback (1907–1994) and Richard Leibler (1914–2003) introduced KL divergence in statistics in Kullback and Leibler (1951) under the term “discrimination information”. An earlier application of KL divergence in 1940/1941 is credited to Alan Turing (1912–1954) by Good (1979).

Negative KL divergence as entropy with regard to a prior measure was discovered in 1878 by Ludwig Boltzmann (1844–1906) in statistical mechanics (see Section 6.1).


  1. The divergence between distributions is unrelated to the divergence vector operator from vector calculus.↩︎

  2. For a proof of the invariance property of KL divergence see Worksheet E1.↩︎