4 Relative entropy
4.1 Cross-entropy
Definition of cross-entropy
If we modify the definition of entropy such that the expectation is taken with regard to a different distribution \(Q\) we arrive at the cross-entropy1 \[ \begin{split} H(Q, P) & =\text{E}_Q\left( S(x, P) \right)\\ & = -\text{E}_Q\left( \log p(x) \right)\\ \end{split} \] i.e. the expected logarithmic score when the data are generated by \(Q\) and model \(P\) is evaluated on the observations. Thus, cross-entropy is a functional of two distributions \(Q\) and \(P\).
For two discrete distributions \(Q\) and \(P\) with probability mass functions \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the cross-entropy is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} \log p(x) \, q(x) \] It can be interpreted as the expected cost or expected code length when when the data are generated according to model \(Q\) and but we use model \(P\) to describe the data.
For two continuous distributions \(Q\) and \(P\) with densities \(q(x)\) and \(p(x)\) we compute the integral \[H(Q, P) =- \int_x \log p(x)\, q(x) \, dx\]
Note that
- Cross-entropy is not symmetric with regard to \(Q\) and \(P\), because the expectation is taken with reference to \(Q\).
- By construction if both distributions \(Q\) and \(P\) are identical cross-entropy reduces to entropy, i.e. \(H(Q, Q) = H(Q)\).
- Like entropy cross-entropy changes under variable transformation for continuous random variables, say from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).
Example 4.1 Cross-entropy between two normals:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). The cross-entropy \(H(F_{\text{ref}}, F)\) is \[ \begin{split} H(F_{\text{ref}}, F) &= -\text{E}_{F_{\text{ref}}} \left( \log p(x |\mu, \sigma^2) \right)\\ &= \frac{1}{2} \text{E}_{F_{\text{ref}}} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu - \mu_{\text{ref}})^2}{ \sigma^2 } +\frac{\sigma^2_{\text{ref}}}{\sigma^2} +\log(2 \pi \sigma^2) \right) \\ \end{split} \] using \(\text{E}_{F_{\text{ref}}} ((x-\mu)^2) = (\mu_{\text{ref}}-\mu)^2 + \sigma^2_{\text{ref}}\).
Gibbs’ inequality
A crucial property of the cross-entropy \(H(Q, P)\) is that it is bounded below by the entropy of \(Q\), therefore \[ H(Q, P) \geq H(Q) \] with equality only if \(Q=P\). This is known as Gibbs’ inequality.
This follows from Jensen’s inequality. For details see Worksheet E1.
Essentially this means that when data are generated under model \(Q\) and encoded with model \(P\) there is always an extra cost, or penalty, to use model \(P\) rather than the correct model \(Q\).
Example 4.2 \(\color{Red} \blacktriangleright\) Scoring rules and Gibbs’ inequality:
The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is called proper because the corresponding expected score, i.e. the cross-entropy \(H(Q, P)\), satisfies the Gibbs’ inequality, and thus the expected score is minimised for \(P=Q\).
Example 4.3 Entropy as lower bound of cross-entropy:
If \(\mu_{\text{ref}} = \mu\) and \(\sigma^2_{\text{ref}} = \sigma^2\) then the cross-entropy \(H(F_{\text{ref}},F)\) in Example 4.1 degenerates to the differential entropy \(H(F_{\text{ref}}) = \frac{1}{2} \left(\log( 2 \pi \sigma^2_{\text{ref}}) +1 \right)\).
See also Example 3.11.
4.2 Boltzmann relative entropy and KL divergence
Boltzmann relative entropy
The Boltzmann relative entropy of a distribution \(Q\) relative to a distribution \(P\) is given by \[ \begin{split} B(Q, H) &= -\text{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ &= H(Q) - H(Q, P) \\ \end{split} \]
As a consequence of the Gibbs’s inequality \(B(Q, H) \leq 0\). In Example 4.8 it is shown that \(B(Q, H)\) can be interpreted as a log-probability.
By construction, the Boltzmann entropy of \(Q\) relative to a uniform distribution \(U\) with constant probability mass function / density is \[ B(Q, U) = H(Q) + \text{const.} \] i.e. the entropy of \(Q\) apart from constant.
Definition of KL divergence
The KL divergence is defined as the negative of the Boltzmann relative entropy as \[ \begin{split} D_{\text{KL}}(Q,P) & = \text{E}_Q\log\left(\frac{q(x)}{p(x)}\right)\\ &= H(Q, P)-H(Q) \\ & = \text{E}_Q \left( S(x, P) - S(x, Q) \right) \\ \end{split} \] As a consequence of the Gibbs’s inequality \(D_{\text{KL}}(Q, H) \geq 0\).
The KL divergence \(D_{\text{KL}}(Q,P)\) is the expected difference in logarithmic scores when the data are generated by \(Q\) and models \(P\) and \(Q\) are evaluated on the observations. \(D_{\text{KL}}(Q, P)\) can be interpreted as the additional cost if \(P\) is used instead \(Q\) to describe data from \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). Conversely, if they are not identical then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).
\(D_{\text{KL}}(Q,P)\) thus serves as a measure of the divergence2 of the two distribution \(P\) and \(Q\). The use of the term “divergence” rather than “distance” is a reminder the \(Q\) and \(P\) are not interchangeable in \(D_{\text{KL}}(Q, P)\).
Properties of KL divergence and Boltzmann relative entropy
Boltzmann relative entropy and KL divergence differ only by sign and thus share a number of key properties inherited from cross-entropy:
- \(D_{\text{KL}}(Q, P) \neq D_{\text{KL}}(Q, P)\), i.e. the KL divergence is not symmetric, \(Q\) and \(P\) cannot be interchanged. This follows from the same property of cross-entropy.
- \(D_{\text{KL}}(Q, P)\geq 0\), follows from Gibbs’ inequality and proof via Jensen’s inequality.
- \(D_{\text{KL}}(Q, P) = 0\) if and only if \(P=Q\), i.e., the KL divergence is zero if and only if \(Q\) and \(P\) are identical. Also follows from Gibbs’ inequality.
For more details and proofs of properties 2 and 3 see Worksheet E1.
A further crucial property of KL divergence is the invariance property:
- \(D_{\text{KL}}(Q, P)\) is invariant under general variable transformations, with \(D_{\text{KL}}(Q_y, P_y) =D_{\text{KL}}(Q_x, P_x)\) under a change of variables from \(x\) to \(y\).
Thus, KL divergence does not change when the sample space is reparameterised.
In the definition of KL divergence the expectation is taken over a ratio of densities (or ratio of probabilities for discrete random variables). This creates the invariance under variable transformation as the Jacobian determinant that changes both densities cancel out.
For more details and proof of the invariance property see Worksheet E1.
Origin of Boltzmann relative entropy and KL divergence and naming conventions
Boltzmann relative entropy was first discovered by Boltzmann (1878)3 in physics in a discrete setting in the context of statistical mechanics (see Example 4.8). In statistics and information theory KL divergence was formally introduced by Kullback and Leibler (1951)4. Good (1979)5 credits Turing with the first statistical application in 1940/1941 in the field of cryptography.
The KL divergence is also known as KL information or KL information number named after two of the original authors (Kullback and Leibler) who themselves referred to this quantity as discrimination information. Another common name is information divergence or short \(\boldsymbol I\)-divergence. Some authors (e.g. Efron) call twice the KL divergence \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) the deviance of \(P\) from \(Q\). In the more general context of scoring rules the divergence is also called discrepancy.
Furthermore, KL divergence is also commonly referred to relative entropy however this use is rather confusing as the Boltzmann relative entropy is defined with opposite negative sign (and is usually maximised) whereas KL divergence is positive (and is usually minimised). Furthermore, Shannon (1948) defined relative entropy yet differently again as the ratio of entropy relative to its maximum value.
Especially in older literature the KL divergence is referred to as “cross-entropy”. This use is discouraged to avoid confusion with the related but different definition of cross-entropy above.
There also exist various notations for KL divergence in the literature. Here we use \(D_{\text{KL}}(Q, P)\) but you often find as well \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\).
In these notes we refer to either Boltzmann relative entropy or KL divergence and assign relative entropy a negative sign.
4.3 KL divergence examples
Models with a single parameter
Example 4.4 KL divergence between two Bernoulli distributions \(\text{Ber}(\theta_1)\) and \(\text{Ber}(\theta_2)\):
The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(\text{Ber}(\theta_1), \text{Ber}(\theta_2))=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]
Example 4.5 KL divergence between two univariate normals with different means and common variance:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2)\) and \(F=N(\mu,\sigma^2)\).
Then we get \[D_{\text{KL}}(F_{\text{ref}}, F )=\frac{1}{2} \left(\frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2}\right)\]
Thus, the squared Euclidean distance is a special case of KL divergence. Note that in this case the KL divergence is symmetric.
Models with multiple parameters
Example 4.6 KL divergence between two categorical distributions with \(K\) classes:
With \(Q=\text{Cat}(\boldsymbol q)\) and \(P=\text{Cat}(\boldsymbol p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:
\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]
To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).
Example 4.7 KL divergence between two univariate normals with different means and variances:
Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). Then \[ \begin{split} D_{\text{KL}}(F_{\text{ref}},F) &= H(F_{\text{ref}}, F) - H(F_{\text{ref}}) \\ &= \frac{1}{2} \left( \frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2} + \frac{\sigma_{\text{ref}}^2}{\sigma^2} -\log\left(\frac{\sigma_{\text{ref}}^2}{\sigma^2}\right)-1 \right) \\ \end{split} \]
If variances are equal then we recover the previous Example 4.5 as special case.
Example 4.8 \(\color{Red} \blacktriangleright\) Boltzmann relative entropy as log-probability:
Assume \(\hat{Q}\) is an empirical categorical distribution based on observed counts \(n_k\) (see Example 3.12) and \(P\) is a second categorical distribution.
The KL divergence is then \[ \begin{split} B(\hat{Q}, P) & = H(\hat{Q}) -H(\hat{Q}, P) \\ & = H(\hat{Q}) + \sum_{i=1}^K \log ( p_i) \, \hat{q}_i \\ & = H(\hat{Q}) + \frac{1}{n} \sum_{i=1}^K n_i \log p_i \\ \end{split} \]
For large \(n\) we may use the multinomial coefficient \(W = \binom{n}{n_1, \ldots, n_K}\) to obtain the entropy of \(\hat{Q}\) (see Example 3.12). This results in \[ \begin{split} B(\hat{Q}, P) &\approx \frac{1}{n} \left( \log W + \sum_{i=1}^K n_i \log p_i \right)\\ & = \frac{1}{n} \log \left( W \times \prod_{i=1}^K p_i^{n_i} \right)\\ & = \frac{1}{n} \log \text{Pr}(n_1, \ldots, n_k| \,\boldsymbol p) \\ \end{split} \] Hence the Boltzmann relative entropy is directly linked to the multinomial probability of the observed counts \(n_1, \ldots, n_k\) under the model \(P\). This derivation of the Boltzmann relative entropy as log-probability of a macrostate is due to Boltzmann (1878). See also Akaike (1985) for a historical account.
This follows the current and widely accepted usage of the term cross-entropy. However, in some typically older literature cross-entropy may refer instead to the related but different KL divergence discussed further below.↩︎
Note that divergence between distributions is not related to and should not be confused with the divergence vector operator used in vector calculus.↩︎
Boltzmann, L. 1878. Weitere Bemerkungen über einige Probleme der mechanischen Wärmetheorie. Wien Ber. 78:7–46. https://doi.org/10.1017/CBO9781139381437.013↩︎
Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statist. 22 79–86. https://doi.org/10.1214/aoms/1177729694↩︎
Good, I. J. 1979. Studies in the history of probability. XXXVII. A. M. Turing’s statistical work in world war II. Biometrika, 66:393–396. https://doi.org/10.1093/biomet/66.2.393↩︎