4 Relative entropy
This chapter introduces entropy measures involving two distributions, such as cross-entropy, Boltzmann relative entropy and Kullback-Leibler (KL) divergence.
4.1 Cross-entropy
Definition of cross-entropy
Given the scoring rule \(S(x, P)\) we now compute its expectation assuming \(x \sim Q\), i.e. with regard to another distribution \(Q\): \[ \begin{split} H(Q, P) & =\text{E}_Q\left( S(x, P) \right)\\ & = -\text{E}_Q\left( \log p(x) \right)\\ \end{split} \] For the logarithmic scoring rule this is called the cross-entropy1 2.
In the above the distribution \(Q\) represent the data-generating process (note the expectation \(\text{E}_Q\) with regard to \(Q\)) and the distribution \(P\) is the the model that is evaluated on the observations (via \(-\log p(x)\)). Thus, cross-entropy is a functional of two distributions \(Q\) and \(P\).
For two discrete distributions \(Q\) and \(P\) with probability mass functions \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the cross-entropy is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} \log p(x) \, q(x) \] It can be interpreted as the expected cost or expected code length when the data are generated according to model \(Q\) (“sender”, “encoder”) and but we use model \(P\) to describe the data (“receiver”, “decoder”).
For two continuous distributions \(Q\) and \(P\) with densities \(q(x)\) and \(p(x)\) we compute the integral \[H(Q, P) =- \int_x \log p(x)\, q(x) \, dx\]
Example 4.1 Cross-entropy between two normals:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). The cross-entropy \(H(F_{\text{ref}}, F)\) is \[ \begin{split} H(F_{\text{ref}}, F) &= -\text{E}_{F_{\text{ref}}} \left( \log p(x |\mu, \sigma^2) \right)\\ &= \frac{1}{2} \text{E}_{F_{\text{ref}}} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu - \mu_{\text{ref}})^2}{ \sigma^2 } +\frac{\sigma^2_{\text{ref}}}{\sigma^2} +\log(2 \pi \sigma^2) \right) \\ \end{split} \] using \(\text{E}_{F_{\text{ref}}} ((x-\mu)^2) = (\mu_{\text{ref}}-\mu)^2 + \sigma^2_{\text{ref}}\).
Properties of cross-entropy
- Cross-entropy is not symmetric with regard to \(Q\) and \(P\), because the expectation is taken with reference to \(Q\).
- If both distributions \(Q\) and \(P\) are identical, cross-entropy reduces to entropy, i.e. \(H(Q, Q) = H(Q)\).
- Like differential entropy cross-entropy changes under variable transformation for continuous random variables, say from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).
Gibbs’ inequality
A crucial further property of the cross-entropy \(H(Q, P)\) is that it is bounded below by the entropy of \(Q\), therefore \[ H(Q, P) \geq H(Q) \] with equality only if \(Q=P\). This is known as Gibbs’ inequality.
This follows from Jensen’s inequality. For details see Worksheet E1.
Essentially this means that when data are generated (encoded) under model \(Q\) and described (decoded) using model \(P\) there is always an extra cost, or penalty, to employ the approximating model \(P\) rather than the correct model \(Q\) .
Example 4.2 \(\color{Red} \blacktriangleright\) Logarithmic scoring rule is strictly proper:
The cross-entropy \(H(Q, P)\) is the risk associated with the logarithmic scoring rule \(S(x, P) = -\log p(x)\). Gibbs’ inequality states that the cross-entropy \(H(Q, P)\) is uniquely minimised at \(P=Q\), with the minimum risk being the entropy \(H(Q)\) of the data-generating model \(Q\). Therefore, the logarithmic scoring rule is strictly proper (see Note 3.1).
Example 4.3 Normal differential entropy as lower bound of normal cross-entropy:
Revisit the cross-entropy \(H(F_{\text{ref}},F)\) in Example 4.1. Setting \(\mu_{\text{ref}} = \mu\) and \(\sigma^2_{\text{ref}} = \sigma^2\) the normal cross-entropy degenerates to the normal differential entropy \(H(F_{\text{ref}}) = \frac{1}{2} \left(\log( 2 \pi \sigma^2_{\text{ref}}) +1 \right)\) as obtained in Example 3.10.
4.2 Boltzmann relative entropy and KL divergence
Boltzmann relative entropy
The Boltzmann entropy of a distribution \(Q\) relative to a distribution \(P\) is given by \[ \begin{split} B(Q, P) &= H(Q) - H(Q, P) \\ &= -\text{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ \end{split} \] The Boltzmann entropy is also known as relative entropy.
As a consequence of the Gibbs’s inequality we see that the Boltzmann entropy is always non-positive, \(B(Q, P) \leq 0\). In Example 4.9 it is shown that \(B(Q, P)\) can be interpreted as a log-probability.
By construction, the Boltzmann entropy of \(Q\) relative to a uniform distribution \(U\) (with constant probability density mass function) is \[ B(Q, U) = H(Q) + \text{const.} \] i.e. it is equal to the entropy of \(Q\) apart from a constant.
Definition of KL divergence
The Kullback-Leibler (KL) divergence is the negative of the Boltzmann relative entropy and is defined as \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q, P)-H(Q) \\ & = \text{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] As a consequence of the Gibbs inequality the KL divergence is always non-negative: \(D_{\text{KL}}(Q, H) \geq 0\).3
The KL divergence \(D_{\text{KL}}(Q, P)\) can be interpreted as the additional cost if model \(P\) is used instead \(Q\) to describe data from \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). However, if they are not identical then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).
\(D_{\text{KL}}(Q,P)\) thus measures the divergence4 between the two distributions \(P\) and \(Q\). The use of the term “divergence” rather than “distance” is a reminder that \(Q\) and \(P\) are not interchangeable in \(D_{\text{KL}}(Q, P)\).
The KL divergence is also known as KL information , KL information number or discrimination information. Another name is information divergence or \(\boldsymbol I\)-divergence. Some authors refer to \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) as the deviance of \(P\) from \(Q\).
Various notations for KL divergence exist, we use \(D_{\text{KL}}(Q, P)\) but \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\) are also common.
In older literature KL divergence is sometimes referred to as “cross-entropy”. This use is outdated and leads to confusion with the modern definition of cross-entropy.
KL divergence is also referred to as “relative entropy” but this leads to confusion in sign with Boltzmann relative entropy. Shannon (1948) defined relative entropy as entropy standardised relative to its maximum value. In this text relative entropy always refers to Boltzmann relative entropy, with the opposite orientation compared to KL divergence.
Properties of KL divergence and Boltzmann relative entropy
Boltzmann relative entropy and KL divergence differ only by sign and thus share a number of key properties inherited from cross-entropy:
- \(D_{\text{KL}}(Q, P) \neq D_{\text{KL}}(Q, P)\), i.e. the KL divergence is not symmetric, \(Q\) and \(P\) cannot be interchanged. This follows from the same property of cross-entropy.
- \(D_{\text{KL}}(Q, P)\geq 0\), follows from Gibbs’ inequality and proof via Jensen’s inequality.
- \(D_{\text{KL}}(Q, P) = 0\) if and only if \(P=Q\), i.e., the KL divergence is zero if and only if \(Q\) and \(P\) are identical. Also follows from Gibbs’ inequality.
For more details and proofs of properties 2 and 3 see Worksheet E1.
Typically, we wish to minimise KL divergence \(D_{\text{KL}}(Q, P)\) and maximise Boltzmann relative entropy \(B(Q, P)\).
Invariance and data processing properties
A further crucial property of KL divergence is its invariance property:
- \(D_{\text{KL}}(Q, P)\) is invariant under general invertible variable transformations , so that \(D_{\text{KL}}(Q_y, P_y) =D_{\text{KL}}(Q_x, P_x)\) under a change of random variable from \(x\) to \(y\). Hence, KL divergence does not change when the sample space is reparametrised.
This is a remarkable property5 6 as it holds not just for discrete but also for continuous random variables. For comparison, recall that both differential entropy as well as cross-entropy for continuous random variables are not transformation invariant.
In the definition of KL divergence the expectation is taken over a ratio of two densities. The invariance is created because the Jacobian determinant changes both densities in the same way under variable transformation and thus cancel out. For more details and more formal proof of the invariance property see Worksheet E1.
More broadly, the KL divergence satisfies the data processing inequality7 i.e. applying a stochastic or deterministic transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). Thus, by processing data you cannot increase information about which distribution generated the data.
Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.
Further properties
The KL divergence and cross-entropy inherit a number of further useful properties from proper scoring rules. We will not cover these in this text. For example, there are various decompositions for the risk and the divergence satisfies a generalised Pythagorean theorem.
In summary, KL divergence stands out among divergences between distributions due to many valuable and in some cases unique properties. It is therefore not surprising that it plays a central role in statistics and machine learning.
4.3 KL divergence examples
Models with a single parameter
Example 4.4 KL divergence between two Bernoulli distributions \(\text{Ber}(\theta_1)\) and \(\text{Ber}(\theta_2)\):
The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(\text{Ber}(\theta_1), \text{Ber}(\theta_2))=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]
Example 4.5 KL divergence between two univariate normals with different means and common variance:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2)\) and \(F=N(\mu,\sigma^2)\). Setting \(\sigma^2_{\text{ref}} = \sigma^2\) in the more general case of the KL divergence between two normals (Example 4.8) yields \[D_{\text{KL}}(F_{\text{ref}}, F )=\frac{1}{2\sigma^2} (\mu-\mu_{\text{ref}})^2 \] which, apart from a scale factor, is the squared Euclidean distance or squared loss between the two means \(\mu_{\text{ref}}\) and \(\mu\). Note that in this case the KL divergence is symmetric with regard to the two mean parameters.
Example 4.6 KL divergence between two univariate normals with common mean and different variances:
Assume \(F_{\text{ref}}=N(\mu,\sigma^2_{\text{ref})}\) and \(F=N(\mu,\sigma^2)\). Setting \(\mu_{\text{ref}} = \mu\) in the more general case of the KL divergence between two normals (Example 4.8) yields
\[ D_{\text{KL}}(F_{\text{ref}},F) = \frac{1}{2} \left( \frac{\sigma_{\text{ref}}^2}{\sigma^2} -\log\left(\frac{\sigma_{\text{ref}}^2}{\sigma^2}\right)-1 \right) \] This is a convex function of the ratio \(\sigma_{\text{ref}}^2/\sigma^2\) of the two variances. Apart from the scale factor this is known as Stein’s loss between the two variances \(\sigma_{\text{ref}}^2\) and \(\sigma^2\).
Models with multiple parameters
Example 4.7 KL divergence between two categorical distributions with \(K\) classes:
With \(Q=\text{Cat}(\boldsymbol q)\) and \(P=\text{Cat}(\boldsymbol p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:
\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]
To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).
Example 4.8 KL divergence between two univariate normals with different means and variances:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). Then with Example 3.10 (entropy of normal) and Example 4.1 (cross-entropy between two normals) we get \[ \begin{split} D_{\text{KL}}(F_{\text{ref}},F) &= H(F_{\text{ref}}, F) - H(F_{\text{ref}}) \\ &= \frac{1}{2} \left( \frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2} + \frac{\sigma_{\text{ref}}^2}{\sigma^2} -\log\left(\frac{\sigma_{\text{ref}}^2}{\sigma^2}\right)-1 \right) \\ \end{split} \]
For the two special cases of equal variances (\(\sigma_{\text{ref}} = \sigma^2\)) and equal means \((\mu_{\text{ref}} = \mu^2\)) see Example 4.5 and Example 4.6.
Example 4.9 \(\color{Red} \blacktriangleright\) Boltzmann relative entropy as log-probability:
Assume \(\hat{Q}\) is an empirical categorical distribution based on observed counts \(n_k\) (see Example 3.11) and \(P\) is a second categorical distribution.
The KL divergence is then \[ \begin{split} B(\hat{Q}, P) & = H(\hat{Q}) -H(\hat{Q}, P) \\ & = H(\hat{Q}) + \sum_{i=1}^K \log ( p_i) \, \hat{q}_i \\ & = H(\hat{Q}) + \frac{1}{n} \sum_{i=1}^K n_i \log p_i \\ \end{split} \]
For large \(n\) we may use the multinomial coefficient \(W = \binom{n}{n_1, \ldots, n_K}\) to obtain the entropy of \(\hat{Q}\) (see Example 3.11). This results in \[ \begin{split} B(\hat{Q}, P) &\approx \frac{1}{n} \left( \log W + \sum_{i=1}^K n_i \log p_i \right)\\ & = \frac{1}{n} \log \left( W \times \prod_{i=1}^K p_i^{n_i} \right)\\ & = \frac{1}{n} \log \text{Pr}(n_1, \ldots, n_K| \,\boldsymbol p) \\ \end{split} \] Hence the Boltzmann relative entropy is directly linked to the multinomial probability of the observed counts \(n_1, \ldots, n_k\) under the model \(P\). This derivation of the Boltzmann relative entropy as log-probability of a macrostate is due to Boltzmann (1878). See also Akaike (1985) for a historical account.
Boltzmann relative entropy was first discovered by Boltzmann (1878)8 in physics in a discrete setting in the context of statistical mechanics (see Example 4.9). In statistics and information theory KL divergence was formally introduced by Kullback and Leibler (1951)9. Good (1979)10 credits Turing with the first statistical application in 1940/1941 in the field of cryptography.
For other scoring rules it is called the risk of \(P\) under the true model \(Q\).↩︎
This follows the current accepted usage. However, in some (typically older) literature the term cross-entropy may refer instead to the KL divergence.↩︎
For any proper scoring rule \(S(x, P)\) the associated divergence or discrepancy is defined in the same fashion: \(D(Q,P) = \text{E}_Q \left( S(x, P) \right) - \text{E}_Q \left( S(x, Q) \right) \geq 0\), i.e. the difference between the risk and the minimum risk. Such divergences closely correspond to Bregman divergences.↩︎
Note that divergence between distributions is unrelated to the divergence vector operator from vector calculus.↩︎
Even more striking, this property only holds for the KL divergence, not for any other divergence induced by a proper scoring rule, making the KL divergence the sole invariant Bregman divergence.↩︎
Furthermore, the KL divergence is also the only \(f\)-divergence (of which the KL divergence is a principal example) that is invariant against coordinate transformations.↩︎
The data processing inequality also holds for all \(f\)-divergences but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).↩︎
Boltzmann, L. 1878. Weitere Bemerkungen über einige Probleme der mechanischen Wärmetheorie. Wien Ber. 78:7–46. https://doi.org/10.1017/CBO9781139381437.013↩︎
Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statist. 22 79–86. https://doi.org/10.1214/aoms/1177729694↩︎
Good, I. J. 1979. Studies in the history of probability. XXXVII. A. M. Turing’s statistical work in world war II. Biometrika, 66:393–396. https://doi.org/10.1093/biomet/66.2.393↩︎