4 Divergence
This chapter introduces entropy measures involving two distributions, such as the mean log-loss or cross-entropy, the Kullback-Leibler (KL) divergence and Boltzmann entropy or relative entropy.
4.1 Scoring rules and risk
Mean log-loss or cross-entropy
The risk of \(P\) under \(Q\) associated with a scoring rule \(S(x, P)\) is the expected loss \[ R_Q(P) =\text{E}_Q\left( S(x, P) \right) \] For the logarithmic scoring rule the risk is the mean log-loss or cross-entropy1 \[ R_Q(P) = - \text{E}_Q \log p(x) = H(Q,P). \] The risk \(R_Q(P)\) is thus a functional of two distributions \(Q\) and \(P\), where \(Q\) represents the data-generating process (note the expectation \(\text{E}_Q\) with regard to \(Q\)) and the distribution \(P\) is the model that is evaluated on the observations.
For two discrete distributions \(Q\) and \(P\) with pmf \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the mean log-loss is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} \log p(x) \, q(x) \] It can be interpreted as the expected cost or expected code length when the data are generated according to model \(Q\) (“sender”, “encoder”) and but we use model \(P\) to describe the data (“receiver”, “decoder”).
For two continuous distributions \(Q\) and \(P\) with pdfs \(q(x)\) and \(p(x)\) we compute the integral \[ H(Q, P) =- \int_x \log p(x)\, q(x) \, dx \]
Properties of the mean log-loss
The mean log-loss is not symmetric with regard to \(Q\) and \(P\), because the expectation is taken with reference to \(Q\).
If both distributions \(Q\) and \(P\) are identical, the mean log-loss reduces to entropy, i.e. \(H(Q, Q) = H(Q)\).
Like differential entropy the mean log-loss changes under variable change for continuous random variables, such as from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).
Gibbs’ inequality
A crucial further property of the mean log-loss \(H(Q, P)\) is that it is bounded below by the entropy of \(Q\), therefore \[ H(Q, P) \geq H(Q) \] with equality only if \(P=Q\). This is known as Gibbs’ inequality.
This lower bound can be established2 using Jensen’s inequality for convex functions (recall that \(-\log(x)\) is strictly convex).
As a consequence, when data are generated (encoded) under model \(Q\) and described (decoded) using model \(P\) there is always an extra cost, or penalty, to employ the approximating model \(P\) rather than the correct model \(Q\).
More generally, proper scoring rules (see Note 3.1) satisfy the inequality \(R_Q(P) \geq R_Q(Q)\). For strictly proper scoring rules the minimum is achieved uniquely at the true distribution \(Q\) (see Note 4.1).
Examples
Example 4.1 Mean log-loss for two normal distributions:
Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). The mean log-loss is \[ \begin{split} H(P_1, P_2) &= -\text{E}_{P_1} \left( \log p(x |\mu_2, \sigma^2_2) \right)\\ &= \frac{1}{2} \text{E}_{P_1} \left( \log(2\pi\sigma^2_2) + \frac{(x-\mu_2)^2}{\sigma^2_2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \\ \end{split} \] using \(\text{E}_{P_1} ((x-\mu_2)^2) = (\mu_1-\mu_2)^2 + \sigma^2_1\).
Example 4.2 Normal mean log-loss reduces to normal differential entropy:
The mean log-loss for two normal distributions \(P_1\) and \(P_2\) is (Example 4.1) \[ H(P_1,P_2) = \frac{1}{2} \left( \frac{(\mu_1 - \mu_2)^2}{ \sigma^2_2 } +\frac{\sigma^2_1}{\sigma^2_2} +\log(2 \pi \sigma^2_2) \right) \] Setting \(P_1=P_2 = P\) with \(\mu_1 = \mu_2 = \mu\) and \(\sigma^2_1 = \sigma^2_2 = \sigma^2\) the mean log-loss degenerates to \[ H(P) = \frac{1}{2} \left(\log( 2 \pi \sigma^2) +1 \right) \] which is the normal differential entropy (Example 3.11).
The mean log-loss \(H(Q, P)\) is the risk associated with the logarithmic scoring rule \(S(x, P) = -\log p(x)\). Gibbs’ inequality states that the mean log-loss \(H(Q, P)\) is uniquely minimised at \(P=Q\), with the minimum risk being the entropy \(H(Q)\) of the data-generating model \(Q\). Therefore, the logarithmic scoring rule is strictly proper (see Note 3.1).
4.2 KL divergence and Boltzmann entropy
Definition of KL divergence
The divergence3 between \(Q\) and \(P\) associated with a proper scoring rule \(S(x, P)\) is defined as the difference between the risk and the minimum risk4: \[ D(Q,P) = R_Q(P) - R_Q(Q) \geq 0 \]
For the logarithmic scoring rule this yields the Kullback-Leibler (KL) divergence between \(Q\) and \(P\): \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q, P)-H(Q) \\ & = \text{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ \end{split} \] The KL divergence is always non-negative, \(D_{\text{KL}}(Q, H) \geq 0\), as a consequence of Gibbs’ inequality.
The KL divergence \(D_{\text{KL}}(Q, P)\) can be interpreted as the additional risk or cost incurred if model \(P\) is used instead \(Q\) to describe data that actually follow \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). However, if they differ then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).
The use of the term “divergence” rather than “distance” is a reminder that \(Q\) and \(P\) are not interchangeable in \(D_{\text{KL}}(Q, P)\).
The KL divergence is also known as KL information , KL information number or discrimination information. Further names are information divergence or \(\boldsymbol I\)-divergence. Some authors refer to \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) as the deviance of \(P\) from \(Q\).
Various notations for KL divergence exist, we use \(D_{\text{KL}}(Q, P)\) but \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\) are also common.
Boltzmann entropy or relative entropy
The Boltzmann entropy of a distribution \(Q\) relative to a distribution \(P\) is given by \[ \begin{split} B(Q, P) &= H(Q) - H(Q, P) \\ &= -\text{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ \end{split} \]
Boltzmann entropy is the negative of the KL divergence, and hence it is always non-positive, \(B(Q, P) \leq 0\).
Boltzmann entropy can be interpreted as relative entropy, containing entropy as a special case without a change of sign (Example 4.8).
Furthermore, Example 4.9 shows that \(B(Q, P)\) can be interpreted as a log-probability and that it is closely linked with the log-likelihood.
The terminology to refer to the negative KL divergence as Boltzmann entropy and as relative entropy follows Akaike (1985) and is in line with the conventions in statistical physics, where this quantity is also known as relative Boltzmann-Gibbs entropy5 (see Note 4.2).
There are inconsistent and ambiguous uses of the terms “cross-entropy” and “relative entropy” in the literature.
In these notes, “cross-entropy” denotes the mean log-loss and “relative entropy” refers to Boltzmann entropy.
Our use of “cross-entropy” follows common modern usage. However, many statistical and engineering sources use the term to mean KL divergence or relative entropy.
The term “relative entropy” is also ambiguous. We follow the convention in statistical mechanics and refer to relative entropy as the negative of KL divergence, denoted here as Boltzmann entropy, since it exhibits the correct orientation for maximisation principles and also contains entropy as special case without a change of sign. However, many authors, particularly in computer science, equate relative entropy directly with KL divergence. Shannon (1948) used relative entropy to denote entropy standardised by its maximum value.
Properties of KL divergence and Boltzmann entropy
Boltzmann entropy and KL divergence differ only by sign and thus share a number of key properties inherited from mean log-loss:
\(D_{\text{KL}}(Q, P) \neq D_{\text{KL}}(Q, P)\), i.e. the KL divergence is not symmetric, \(Q\) and \(P\) cannot be interchanged. This follows from the same property of the mean log-loss.
\(D_{\text{KL}}(Q, P)\geq 0\), follows from Gibbs’ inequality and proof via Jensen’s inequality.
\(D_{\text{KL}}(Q, P) = 0\) if and only if \(P=Q\), i.e., the KL divergence is zero if and only if \(Q\) and \(P\) are identical. Also follows from Gibbs’ inequality.
Typically, we will minimise quantities related to risk:
- KL divergence and mean log-loss
and, conversely, maximise quantities related to entropy:
- Boltzmann entropy (and log-likelihood) and Shannon-Gibbs entropy.
Invariance and data processing properties
A further crucial property of KL divergence is its invariance property:
- \(D_{\text{KL}}(Q, P)\) is invariant under general invertible variable transformations , so that \(D_{\text{KL}}(Q_y, P_y) =D_{\text{KL}}(Q_x, P_x)\) under a change of random variable from \(x\) to \(y\). Hence, KL divergence does not change when the sample space is reparametrised.6
This is a remarkable property7 8 as it holds not just for discrete but also for continuous random variables. For comparison, recall that both differential entropy as well as mean log-loss for continuous random variables are not transformation invariant.
In the definition of KL divergence the expectation is taken over a ratio of two densities. The invariance is created because the Jacobian determinant changes both densities in the same way under variable transformation and thus cancel out.
More broadly, the KL divergence satisfies the data processing inequality9, i.e. applying a stochastic or deterministic transformation to the underlying random variables cannot increase the KL divergence \(D_{\text{KL}}(Q,P)\) between \(Q\) and \(P\). Thus, by processing data you cannot increase information about which distribution generated the data.
Coordinate transformations can be viewed as a special case of data processing, and for \(D_{\text{KL}}(Q,P)\) the data-processing inequality under general invertible transformations becomes an identity.
Further properties
The KL divergence and mean log-loss inherit a number of further useful properties from proper scoring rules. We will not cover these in this text. For example, there are various decompositions for the risk and the divergence satisfies a generalised Pythagorean theorem.
In summary, KL divergence stands out among divergences between distributions due to many valuable and in some cases unique properties. It is therefore not surprising that it plays a central role in statistics and machine learning.
4.3 KL divergence examples
Models with a single parameter
Example 4.3 KL divergence between two Bernoulli distributions \(B_1 = \text{Ber}(\theta_1)\) and \(B_2 = \text{Ber}(\theta_2)\):
The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(B_1, B_2)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]
Example 4.4 KL divergence between two univariate normals with different means and common variance:
Assume \(P_1=N(\mu_1,\sigma^2)\) and \(P_2=N(\mu_2,\sigma^2)\). Setting \(\sigma^2_1 =\sigma^2_2 = \sigma^2\) in the more general case of the KL divergence between two normals (Example 4.7) yields \[D_{\text{KL}}(P_1, P_2 )=\frac{1}{2\sigma^2} (\mu_1-\mu_2)^2 \] which, apart from a scale factor, is the squared Euclidean distance or squared loss between the two means \(\mu_1\) and \(\mu_2\). Note that in this case the KL divergence is symmetric with regard to the two mean parameters.
Example 4.5 KL divergence between two univariate normals with common mean and different variances:
Assume \(P_1=N(\mu,\sigma^2_1\) and \(P_2=N(\mu,\sigma^2_2)\). Setting \(\mu_1 = \mu_2 = \mu\) in the more general case of the KL divergence between two normals (Example 4.7) yields
\[ D_{\text{KL}}(P_1,P_2) = \frac{1}{2} \left( \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \] This is a convex function of the ratio \(\sigma_1^2/\sigma^2_2\) of the two variances. Apart from the scale factor this is known as Stein’s loss between the two variances \(\sigma^2_1\) and \(\sigma^2_2\).
Models with multiple parameters
Example 4.6 KL divergence between two categorical distributions with \(K\) classes:
With \(Q=\text{Cat}(\boldsymbol q)\) and \(P=\text{Cat}(\boldsymbol p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:
\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]
To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).
Example 4.7 KL divergence between two univariate normals with different means and variances:
Assume \(P_1=N(\mu_1,\sigma^2_1)\) and \(P_2=N(\mu_2,\sigma^2_2)\). Then with Example 3.11 (entropy of normal) and Example 4.1 (mean log-loss between two normals) we get \[ \begin{split} D_{\text{KL}}(P_1,P_2) &= H(P_1, P_2) - H(P_1) \\ &= \frac{1}{2} \left( \frac{(\mu_1-\mu_2)^2}{\sigma^2_2} + \frac{\sigma^2_1}{\sigma^2_2} -\log\left(\frac{\sigma^2_1}{\sigma^2_2}\right)-1 \right) \\ \end{split} \]
For the two special cases of equal variances (\(\sigma^2_1 =\sigma^2_2 = \sigma^2\)) and equal means \((\mu_1 = \mu_2 = \mu\)) see Example 4.4 and Example 4.5.
Example 4.8 Boltzmann entropy reduces to entropy:
The mean log-loss \(H(Q, U)\) for true model \(Q\) and a uniform distribution \(U\) with constant pdmf is \[ H(Q, U) = \text{const.} \] and does not depend on \(Q\).
The Boltzmann entropy of \(Q\) relative to a uniform distribution \(U\) is therefore \[ B(Q, U) = H(Q) + \text{const.} \] Thus, up to a constant Boltzmann entropy is equal to the entropy of \(Q\), and crucially, both quantities share the same sign (and hence orientation).
Example 4.9 \(\color{Red} \blacktriangleright\) Boltzmann entropy as log-probability:
Assume \(\hat{Q}\) is an empirical categorical distribution based on observed counts \(n_k\) (see Example 3.12) and \(P\) is a second categorical distribution.
The Boltzmann entropy is then \[ \begin{split} B(\hat{Q}, P) & = H(\hat{Q}) -H(\hat{Q}, P) \\ & = H(\hat{Q}) + \sum_{i=1}^K \log ( p_i) \, \hat{q}_i \\ & = H(\hat{Q}) + \frac{1}{n} \sum_{i=1}^K n_i \log p_i \\ \end{split} \]
For large \(n\) we may use the multinomial coefficient \(W = \binom{n}{n_1, \ldots, n_K}\) to obtain the entropy of \(\hat{Q}\) (see Example 3.12). This results in \[ \begin{split} B(\hat{Q}, P) &\approx \frac{1}{n} \left( \log W + \sum_{i=1}^K n_i \log p_i \right)\\ & = \frac{1}{n} \log \left( W \times \prod_{i=1}^K p_i^{n_i} \right)\\ & = \frac{1}{n} \log \text{Pr}(n_1, \ldots, n_K| \,\boldsymbol p) \\ \end{split} \] Hence the Boltzmann entropy is directly linked to the multinomial probability of the observed counts \(n_1, \ldots, n_k\) under the model \(P\). Note the Boltzmann entropy is essentially the multinomial log-likelihood.
The concept of relative entropy and the insight that entropy is log- probability (see Example 4.9) is due to Boltzmann, see Akaike (1985) for a historical account.
In statistics KL divergence was formally introduced by Kullback and Leibler (1951), even though Good (1979) credits Turing with an earlier application in 1940/1941 in the field of cryptography.
Dawid and Musio (2014) review entropy and divergence associated with general proper scoring rules.
There are multiple, inconsistent uses of the term “cross-entropy” in the literature, so we will mostly use “mean log-loss” for clarity (see Note 4.2).↩︎
For details see Worksheet E1.↩︎
Note that divergence between distributions is unrelated to the divergence vector operator from vector calculus.↩︎
Such divergences closely correspond to Bregman divergences.↩︎
For a proof of the invariance property see Worksheet E1.↩︎
Even more striking, this property only holds for the KL divergence, not for any other divergence induced by a proper scoring rule, making the KL divergence the sole invariant Bregman divergence.↩︎
Furthermore, the KL divergence is also the only \(f\)-divergence (of which the KL divergence is a principal example) that is invariant against coordinate transformations.↩︎
The data processing inequality also holds for all \(f\)-divergences but is notably not satisfied by divergences of other proper scoring rules (and thus other Bregman divergences).↩︎