4 Divergence
This chapter introduces entropy measures involving two distributions, such as the mean log-loss or cross-entropy, the Kullback-Leibler (KL) divergence and Boltzmann entropy or relative entropy.
4.1 Scoring rules and risk
Mean log-loss or cross-entropy
The risk of \(P\) under \(Q\) associated with a scoring rule \(S(x, P)\) is the expected loss \[ R_Q(P) =\operatorname{E}_Q\left( S(x, P) \right) \]
For the logarithmic scoring rule the risk is called the mean log-loss or cross-entropy1 \[ R_Q(P) = - \operatorname{E}_Q \log p(x) = H(Q,P). \] The risk \(R_Q(P)\) is thus a functional of two distributions \(Q\) and \(P\), where \(Q\) represents the data-generating process (note the expectation \(\operatorname{E}_Q\) with regard to \(Q\)) and the distribution \(P\) is the model that is evaluated on the observations.
It can be interpreted as the expected cost or expected code length when the data are generated according to model \(Q\) (“sender”, “encoder”) and but we use model \(P\) to describe the data (“receiver”, “decoder”).
Computing the mean log-loss
For two discrete distributions \(Q\) and \(P\) with pmf \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the mean log-loss is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} q(x) \, \log p(x) \]
For two continuous distributions \(Q\) and \(P\) with pdfs \(q(x)\) and \(p(x)\) we compute the integral \[ H(Q, P) =- \int_x q(x) \, \log p(x) \, dx \]
Properties of the mean log-loss
The mean log-loss is not symmetric with regard to \(Q\) and \(P\), because the expectation is taken with reference to \(Q\).
If both distributions \(Q\) and \(P\) are identical, the mean log-loss reduces to entropy, i.e. \(H(Q, Q) = H(Q)\).
Like differential entropy the mean log-loss changes under variable change for continuous random variables, such as from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).
Gibbs’ inequality
A crucial further property of the mean log-loss \(H(Q, P)\) is that it is bounded below by the entropy of \(Q\), therefore \[ H(Q, P) \geq H(Q) \] with equality only if \(P=Q\). This is known as Gibbs’ inequality.
This lower bound can be established2 using Jensen’s inequality for convex functions (recall that \(-\log x\) is strictly convex).
As a consequence, when data are generated (encoded) under model \(Q\) and described (decoded) using model \(P\) there is always an extra cost, or penalty, to employ the approximating model \(P\) rather than the correct model \(Q\).
More generally, proper scoring rules (see Note 3.1) satisfy the inequality \(R_Q(P) \geq R_Q(Q)\). For strictly proper scoring rules the minimum is achieved uniquely at the true distribution \(Q\) (see Note 4.1).
Examples
Example 4.1 Mean log-loss for two normal distributions:
Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). The mean log-loss is \[ \begin{split} H(P_1, P_2) &= -\operatorname{E}_{P_1} \left( \log p(x |\mu_2, \sigma^2_2) \right)\\ &= \frac{1}{2} \operatorname{E}_{P_1} \left( \log(2\pi\sigma^2_2) + \frac{(x-\mu_2)^2}{\sigma^2_2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \\ \end{split} \] using \(\operatorname{E}_{P_1} ((x-\mu_2)^2) = (\mu_1-\mu_2)^2 + \sigma^2_1\).
Example 4.2 Normal mean log-loss reduces to normal differential entropy:
The mean log-loss for two normal distributions \(P_1\) and \(P_2\) is (Example 4.1) \[ H(P_1,P_2) = \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \] Setting \(P_1=P_2 = P\) with \(\mu_1 = \mu_2 = \mu\) and \(\sigma^2_1 = \sigma^2_2 = \sigma^2\) the mean log-loss degenerates to \[ H(P) = \frac{1}{2} \left(\log( 2 \pi \sigma^2) +1 \right) \] which is the normal differential entropy (Example 3.11).
The mean log-loss \(H(Q, P)\) is the risk associated with the logarithmic scoring rule \(S(x, P) = -\log p(x)\). Gibbs’ inequality states that the mean log-loss \(H(Q, P)\) is uniquely minimised at \(P=Q\), with the minimum risk being the entropy \(H(Q)\) of the data-generating model \(Q\). Therefore, the logarithmic scoring rule is strictly proper (see Note 3.1).
4.2 KL divergence and Boltzmann entropy
Divergence
The divergence between \(Q\) and \(P\) associated with a proper scoring rule \(S(x, P)\) is defined as the difference between the risk and the minimum risk3: \[ D(Q,P) = R_Q(P) - R_Q(Q) \geq 0 \] The use of the term “divergence” rather than “distance” is a reminder that \(Q\) and \(P\) are not interchangeable in \(D(Q, P)\).
Furthermore, note that divergence between distributions is unrelated to the divergence vector operator from vector calculus.
KL divergence
For the logarithmic scoring rule this yields the Kullback-Leibler (KL) divergence between \(Q\) and \(P\): \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q, P)-H(Q) \\ & = \operatorname{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] The KL divergence is always non-negative, \(D_{\text{KL}}(Q, H) \geq 0\), as a consequence of Gibbs’ inequality.
Various notations for KL divergence exist, we use \(D_{\text{KL}}(Q, P)\) but \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\) are also common.
The KL divergence is also known as KL information , KL information number or discrimination information. Further names are information divergence or \(\boldsymbol I\)-divergence. Some authors refer to \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) as the deviance of \(P\) from \(Q\).
Interpretation
The KL divergence \(D_{\text{KL}}(Q, P)\) can be interpreted as the excess risk or additional cost incurred if model \(P\) is used to describe data that actually follow \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). However, if they differ then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).
Boltzmann entropy or relative entropy
The Boltzmann entropy of a distribution \(Q\) relative to a distribution \(P\) is given by \[ \begin{split} B(Q, P) &= H(Q) - H(Q, P) \\ &= -\operatorname{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ \end{split} \]
Boltzmann entropy is the negative of the KL divergence, and hence it is always non-positive, \(B(Q, P) \leq 0\).
Boltzmann entropy can be interpreted as relative entropy of \(Q\) relative to \(P\), containing entropy as a special case without a change of sign (Example 4.8).
Furthermore, Example 4.9 shows that \(B(Q, P)\) can be interpreted as a log-probability and that it is closely linked with the log-likelihood.
Properties of KL divergence and Boltzmann entropy
Boltzmann entropy and KL divergence differ only by sign and thus share a number of key properties inherited from mean log-loss:
\(D_{\text{KL}}(Q, P) \neq D_{\text{KL}}(Q, P)\), i.e. the KL divergence is not symmetric, \(Q\) and \(P\) cannot be interchanged. This follows from the same property of the mean log-loss.
\(D_{\text{KL}}(Q, P)\geq 0\), follows from Gibbs’ inequality and proof via Jensen’s inequality.
\(D_{\text{KL}}(Q, P) = 0\) if and only if \(P=Q\), i.e., the KL divergence is zero if and only if \(Q\) and \(P\) are identical. Also follows from Gibbs’ inequality.
Typically, we will minimise quantities related to risk:
- KL divergence and mean log-loss
and, conversely, maximise quantities related to entropy:
- Boltzmann entropy (and log-likelihood) and Shannon-Gibbs entropy.
Invariance and data processing properties
A further crucial property of KL divergence is its invariance property:
- \(D_{\text{KL}}(Q, P)\) is invariant under general invertible variable transformations , so that \(D_{\text{KL}}(Q_y, P_y) =D_{\text{KL}}(Q_x, P_x)\) under a change of random variable from \(x\) to \(y\). Hence, KL divergence does not change when the sample space is reparametrised.4
This is a remarkable property5 6 as it holds not just for discrete but also for continuous random variables. For comparison, recall that both differential entropy as well as mean log-loss for continuous random variables are not transformation invariant.
In the definition of KL divergence the expectation is taken over a ratio of two densities. The invariance is created because the Jacobian determinant changes both densities in the same way under variable transformation and thus cancel out.
More broadly, the KL divergence satisfies the data processing inequality7, i.e. applying a stochastic or deterministic transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). Thus, by processing data you cannot increase information about which distribution generated the data.
Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.
Further properties
The KL divergence and mean log-loss inherit a number of further useful properties from proper scoring rules. We will not cover these in this text. For example, there are various decompositions for the risk and the divergence satisfies a generalised Pythagorean theorem.
In summary, KL divergence stands out among divergences between distributions due to many valuable and in some cases unique properties. It is therefore not surprising that it plays a central role in statistics and machine learning.
There are inconsistent and ambiguous uses of the terms “cross-entropy” and “relative entropy” in the literature.
In these notes, “cross-entropy” denotes the mean log-loss and “relative entropy” refers to Boltzmann entropy.
Our use of “cross-entropy” follows common modern usage. However, many works in statistics and engineering also use the term to mean KL divergence or relative entropy.
The term “relative entropy” is also ambiguous. Our terminology to equate relative entropy with the negative KL divergence and to refer to it as Boltzmann entropy follows Akaike (1985) and is in line with the conventions in statistical physics, where this quantity is also known as relative Boltzmann-Gibbs entropy8. However, many current authors, particularly in computer science, consider relative entropy to be identical with KL divergence. Shannon (1948) also used the term relative entropy to denote entropy standardised by its maximum value.
4.3 KL divergence examples
Models with a single parameter
Example 4.3 KL divergence between two Bernoulli distributions \(B_1 = \operatorname{Ber}(\theta_1)\) and \(B_2 = \operatorname{Ber}(\theta_2)\):
The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(B_1, B_2)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]
Example 4.4 KL divergence between two univariate normals with different means and common variance:
Assume \(P_1=N(\mu_1,\sigma^2)\) and \(P_2=N(\mu_2,\sigma^2)\). Setting \(\sigma^2_1 =\sigma^2_2 = \sigma^2\) in the more general case of the KL divergence between two normals (Example 4.7) yields \[D_{\text{KL}}(P_1, P_2 )=\frac{1}{2\sigma^2} (\mu_1-\mu_2)^2 \] which, apart from a scale factor, is the squared Euclidean distance or squared loss between the two means \(\mu_1\) and \(\mu_2\). Note that in this case the KL divergence is symmetric with regard to the two mean parameters.
Example 4.5 KL divergence between two univariate normals with common mean and different variances:
Assume \(P_1=N(\mu,\sigma^2_1\) and \(P_2=N(\mu,\sigma^2_2)\). Setting \(\mu_1 = \mu_2 = \mu\) in the more general case of the KL divergence between two normals (Example 4.7) yields
\[ D_{\text{KL}}(P_1,P_2) = \frac{1}{2} \left( \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \] This is a convex function of the ratio \(\sigma_1^2/\sigma^2_2\) of the two variances. Apart from the scale factor this is known as Stein’s loss between the two variances \(\sigma^2_1\) and \(\sigma^2_2\).
Models with multiple parameters
Example 4.6 KL divergence between two categorical distributions with \(K\) classes:
With \(Q=\operatorname{Cat}(\boldsymbol q)\) and \(P=\operatorname{Cat}(\boldsymbol p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:
\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]
To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).
Example 4.7 KL divergence between two univariate normals with different means and variances:
Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). Then with Example 3.11 (entropy of normal) and Example 4.1 (mean log-loss between two normals) we get \[ \begin{split} D_{\text{KL}}(P_1,P_2) &= H(P_1, P_2) - H(P_1) \\ &= \frac{1}{2} \left( \frac{(\mu_1-\mu_2)^2}{\sigma^2_2} + \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \\ \end{split} \]
For the two special cases of equal variances (\(\sigma^2_1 =\sigma^2_2 = \sigma^2\)) and equal means \((\mu_1 = \mu_2 = \mu\)) see Example 4.4 and Example 4.5.
Example 4.8 Boltzmann entropy reduces to entropy:
The mean log-loss \(H(Q, U)\) for a true model \(Q\) and the uniform distribution \(U\) (with constant pdmf) is \[ H(Q, U) = \text{const.} \] and does not depend on \(Q\).
The Boltzmann entropy of \(Q\) relative to \(U\) is therefore \[ B(Q, U) = H(Q) + \text{const.} \] Thus, the Boltzmann entropy of \(Q\) relative to the uniform distribution \(U\) is, up to a constant, equal to the entropy of \(Q\). Crucially, both quantities share the same sign and orientation.
Example 4.9 \(\color{Red} \blacktriangleright\) Boltzmann entropy as log-probability:
Assume \(\hat{Q}=\operatorname{Cat}(\hat{\boldsymbol q})\) is an empirical categorical distribution based on observed counts \(D = \{n_1, \ldots, n_K\}\) (see Example 3.12) and \(P=\operatorname{Cat}(\boldsymbol p)\) is a second categorical distribution.
The Boltzmann entropy is then \[ \begin{split} B(\hat{Q}, P) & = H(\hat{Q}) -H(\hat{Q}, P) \\ & = H(\hat{Q}) + \sum_{k=1}^K \hat{q}_k \log p_k \\ & = H(\hat{Q}) + \frac{1}{n} \sum_{k=1}^K n_k \log p_k \\ \end{split} \]
For large \(n\) we may use the multinomial coefficient \(W_K = \binom{n}{n_1, \ldots, n_K}\) to obtain the entropy of \(\hat{Q}\) (see Example 3.12). This results in \[ \begin{split} B(\hat{Q}, P) &\approx \frac{1}{n} \left( \log W_K + \sum_{k=1}^K n_k \log p_k \right)\\ & = \frac{1}{n} \log \left( W_K \times \prod_{k=1}^K p_k^{n_k} \right)\\ & = \frac{1}{n} \log p(D| \,\boldsymbol p) \\ \end{split} \] Hence the Boltzmann entropy is directly linked to the multinomial probability of the observed counts \(D\) under the model \(P\) and equals the standardised multinomial log-likelihood.
4.4 Further reading
See Akaike (1985) for a historical account of Boltzmann entropy as log-probability
Dawid and Musio (2014) review divergences associated with general proper scoring rules.
See also: Bregman divergences (Wikipedia)
The concept of relative entropy and entropy as log-probability (see Example 4.9) is due to Ludwig Boltzmann (1844–1906).
Solomon Kullback (1907–1994) and Richard Leibler (1914–2003) introduced KL divergence in statistics in Kullback and Leibler (1951). An earlier application of KL divergence in 1940/1941 is credited to Alan Turing (1912–1954) by Good (1979).
There are multiple, inconsistent uses of the term “cross-entropy” in the literature, so we will mostly use “mean log-loss” for clarity (see Note 4.2).↩︎
For details see Worksheet E1.↩︎
These divergences derived from proper scoring rules (established in statistics) are closely related to Bregman divergences (established in optimisation and widely used under this name in machine learning)↩︎
For a proof of the invariance property see Worksheet E1.↩︎
Even more striking, this property only holds for the KL divergence, not for any other divergence induced by a proper scoring rule, making the KL divergence the sole invariant Bregman divergence.↩︎
Furthermore, the KL divergence is also the only \(f\)-divergence (of which the KL divergence is a principal example) that is invariant against coordinate transformations.↩︎
The data processing inequality also holds for all \(f\)-divergences but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).↩︎