2 Entropy and KL information

2.1 Information theory and statistics

Information theory and statistical learning is closely linked. The purpose of this chapter is to introduce various important information criteria used in statistics and machine learning. These are all based on entropy and provide the foundation for the method of maximum likelihood as well as Bayesian learning. They also provides the basis for the asymptotic validity of maximum likelihood estimation.

The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in the context of thermodynamics. In physics entropy measures the distribution of energy: if energy is concentrated then the entropy is low, and conversely if energy is spread out the entropy is large. The total energy is conserved (first law of thermodynamics) but with time it will diffuse and thus entropy will increase with time (second law of thermodynamics).

The modern probabilistic definition of entropy was discovered in the 1870s by Ludwig Boltzmann (1844–1906) and Josiah W. Gibbs (1839–1903). In statistical mechanics entropy is proportional to the logarithm of the number of microstates (i.e. particular configurations of the system) compatible with the observed macrostate. Typically, in systems where the energy is spread out there are very large numbers of compatible configurations hence this corresponds to large entropy, and conversely, if the energy is concentrated there are only few such configurations, and thus is corresponds to low entropy.

In the 1940–1950’s the notion of entropy turned out to be central also in information theory, a field pioneered by mathematicians such as Ralph Hartley (1888–1970), Solomon Kullback (1907–1994), Alan Turing (1912–1954), Richard Leibler (1914–2003), Irving J. Good (1916–2009), Claude Shannon (1916–2001), and Edwin T. Jaynes (1922–1998), and later further explored by Shun’ichi Amari (1936–), Imre Ciszár (1938–), Bradley Efron (1938–), Philip Dawid (1946–) and many others.

Of the above, Turing and Good are affiliated with the University of Manchester.

\[\begin{align*} \left. \begin{array}{cc} \\ \textbf{Entropy} \\ \\ \end{array} \right. \left. \begin{array}{cc} \\ \nearrow \\ \searrow \\ \\ \end{array} \right. \begin{array}{ll} \text{Shannon Entropy} \\ \\ \text{KL information} \\ \end{array} \begin{array}{ll} \text{(Shannon 1948)} \\ \\ \text{(Kullback-Leibler 1951)} \\ \end{array} \end{align*}\]

\[\begin{align*} \left. \begin{array}{ll} \text{Fisher information} \\ \\ \text{Mutual Information} \\ \end{array} \right. \begin{array}{ll} \rightarrow\text{ Likelihood theory} \\ \\ \rightarrow\text{ Information theory} \\ \end{array} \begin{array}{ll} \text{(Fisher 1922)} \\ \\ \text{(Shannon 1948, Lindley 1953)} \\ \end{array} \end{align*}\]

2.2 Shannon entropy and differential entropy

The logarithm and units of information storage

In this module the logarithmic function \(\log(x)\) without explicitly stated base always denotes the natural logarithm. For logarithms with respect to base 2 and 10 we write \(\log_2(x)\) and \(\log_{10}(x)\), respectively.

Assume we have a discrete variable \(x\) for a system with with \(K\) possible states \(\Omega = \{\omega_1, \ldots, \omega_K\}\) and we have individual storage units available with the capability to index a limited number \(a\) of different states. Following the principle underlying common numeral systems, such as the decimal, binary or hexadecimal numbers, using \(S = \log_a K\) of such information storage units is sufficient to describe and store the system configuration.

The storage requirement \(S\) can also be interpreted as the code length or the cost needed to describe the system state using an alphabet of size \(a\).

The above tacitly assumes that all \(K\) states are treated equally so the storage size / code length / cost requirement associated with each state is constant and the same for all possible \(K\) states. With this in mind we can write \[ S = -\log \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.

Example 2.1 Information storage units:

For \(a=2\) the storage units are called “bits” (binary information units), and a single bit can store 2 states. Hence to describe the \(K=256\) possible states in a system \(8=\log_2 256\) bits (or 1 byte) of storage are sufficient.

For \(a=10\) the units are “dits” (decimal information units), so to describe \(K=100\) possible states \(2=\log_{10} 100\) dits are sufficient, where a single dit can store 10 states.

Finally, if the natural logarithm is used (\(a=e\)) the storage units are called “nits” (natural information units). In the following we will use “nits” and natural logarithm throughout.

Surprise or surprisal and logarithmic scoring rule

In practise, the \(K\) states may not all be equally probably, and assume there is a discrete distribution \(P\) with probability mass function \(p(x)\) to model the state probabilities. In this case, instead of using the same code length to describe each state, we may use variable code lengths, with more probable states assigned shorter codes and less probable states having longer codes. More specifically, generalising from the previous we may use the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] As we will see below (Example 2.8, Example 2.9 and Example 2.10) using logarithmic cost allows for expected code lengths that are potentially much smaller than the fixed length \(\log K\), and hence leads to a more space saving representation.

The negative logarithm of the probability \(p(x)\) of an event \(x\) is known as the surprise or surprisal. The surprise to observe a certain event (with \(p(x)=1\)) is zero, and conversely the surprise to observe an event that is certain not to happen (with \(p(x)=0\)) is infinite.

We will apply \(S(x, P)\) to both discrete and and continuous variables \(x\) and corresponding distributions \(P\) and then call it logarithmic score or logarithmic scoring rule (see also Example 2.4). As densities can take on values larger than 1 the logarithmic score \(S(x, P)\) may therefore become negative when \(P\) is a continuous distribution.

Example 2.2 Log-odds ratio and surprise:

The commonly used log-odds ratio of the probability \(p\) of an event is the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event:

\[ \begin{split} \text{logit}(p) &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \]

Example 2.3 Logarithmic score and normal distribution:

If we quote in the logarithmic scoring rule the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get as score \[ S\left(x,N(\mu, \sigma^2 )\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For fixed variance \(\sigma^2\) this is equivalent to the squared error from the parameter \(\mu\).

Example 2.4 \({\color{Red} \blacktriangleright}\) General scoring rules:

The function \(S(x, P) = -\log p(x)\) is an important example of a scoring rule for a probabilistic forecast represented by model \(P\) evaluated on the observation \(x\). ¹

While one can devise many different scoring rules the logarithmic scoring rule stands out as it has a number of unique and favourable properties (e.g. Hartley 1928, Shannon 1948, Good 1952, Bernardo 1979). In particular, it is the only scoring rule that is both proper, i.e. the expected score is minimised when the quoted model \(P\) is identical to the data generating model, and local in that the score depends only on the value of the density/probability mass function at \(x\).

Entropy of a distribution

The entropy of the distribution \(P\) is defined as the functional \[ \begin{split} H(P) &= \text{E}_P\left( S(x, P) \right) \\ &= - \text{E}_P\left(\log p(x)\right) \\ \end{split} \] i.e. as the expected logarithmic score when the data are generated by \(P\) and the model \(P\) is evaluated on the observations. As will be clear from the examples, entropy measures the spread of the probability mass across a distribution. If the probability mass is locally concentrated the entropy will be low, and conversely, if the probability mass is spread out the entropy will be large.

The entropy of a discrete probability distribution \(P\) with probability mass function \(p(x)\) with \(x \in \Omega\) is called Shannon entropy (1948) ². In statistical physics, the Shannon entropy is known as Gibbs entropy (1878):

\[ H(P) = - \sum_{x \in \Omega} \log p(x) \, p(x) \] The entropy of a discrete distribution is the expected surprise. We can also interpret it as the expected cost or expected code length when the data are generated according to model \(P\) and we are also using model \(P\) to describe the data. Furthermore, it also has a combinatorial interpretation (see Example 2.12).

As \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\) by construction Shannon entropy is bounded below and must be larger or equal to 0.

Applying the definition of entropy to a continuous probability distribution \(P\) with density \(p(x)\) yields the differential entropy: \[ H(P) = -\text{E}_P(\log p(x)) = - \int_x \log p(x) \, p(x) \, dx \] Because the logarithm is taken of a density, which in contrast to a probability can assume values larger than one, differential entropy can be negative.

Furthermore, since for continuous random variables the shape of the density typically changes under variable transformation, say from \(x\) to \(y\), the differential entropy will change as well under such a transformation so that \(H(P_y) \neq H(P_x)\).

2.3 Entropy examples

Models with single parameter

Example 2.5 The Shannon entropy of the geometric distribution \(F_x = \text{Geom}(\theta)\) with probability mass function \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\text{E}(x)= 1/\theta\) is \[ \begin{split} H(F_x) &= - \text{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log(0)=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.

Example 2.6 Consider the uniform distribution \(F_x = U(0, a)\) with \(a>0\), support from \(0\) to \(a\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( F_x ) &= - \int_0^a \log\left(\frac{1}{a}\right) \, \frac{1}{a} dx \\ &= \log a \int_0^a \frac{1}{a} dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.

Example 2.7 Starting with the uniform distribution \(F_x = U(0, a)\) from Example 2.6 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(F_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).

The corresponding differential entropy is \[ \begin{split} H( F_y ) &= \int_0^{a^2} \log \left(2 a \sqrt{y}\right) \, 1/\left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( F_y ) \neq H( F_x )\) as differential entropy is not invariant against variable transformations.

Models with multiple parameters

Example 2.8 The Shannon entropy of the categorical distribution \(P\) with \(K\) categories with class probabilities \(p_1, \ldots, p_K\) is \[ H(P) = - \sum_{k=1}^{K } \log(p_k)\, p_k \]

As \(P\) is discrete \(H(P)\) is bounded below by 0. Furthermore, it is also bounded above by \(\log K\). This can be seen by maximising Shannon entropy with regard to the \(p_k\) under the constraint \(\sum_{k=1}^K p_k= 1\), e.g., by constrained optimisation using Lagrange multipliers. Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \] The maximum is achieved for the discrete uniform distribution (Example 2.9) and the minimum for a concentrated categorical distribution (Example 2.10).

Example 2.9 Entropy for the discrete uniform distribution \(U_K\):

Let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\log\left(\frac{1}{K}\right)\, \frac{1}{K} = \log K\]

Note that \(\log K\) is the largest value the Shannon entropy can assume with \(K\) classes and indicates maximum spread of probability mass.

Example 2.10 Entropy for a categorical distribution with concentrated probability mass:

Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon entropy \[H(P) = \log(1)\times 1 + \log(0)\times 0 + \dots = 0\]

Note that 0 is the smallest value that Shannon entropy can assume and that it corresponds to maximum concentration of probability mass.

Example 2.11 Differential entropy of the normal distribution:

The log density of the univariate normal \(N(\mu, \sigma^2)\) distribution is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Note that \(H(P)\) only depends on the variance parameter and not on the mean parameter. This intuitively clear as only the variance controls the concentration of the probability mass. The entropy grows with the variance as the probability mass becomes more spread out and less concentrated around the mean. For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.

Example 2.12 \({\color{Red} \blacktriangleright}\) Entropy of a categorical distribution and the multinomial coefficient:

Let \(\hat{Q}\) be the empirical categorical distribution with \(\hat{q}_k = n_k/n\) the observed frequencies with \(n_k\) counts in class \(k\) and \(n=\sum_{k=1}^K\) total counts.

The number of possible permutation of \(n\) items of \(K\) distinct types is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \]

It turns out that for large \(n\) both quantities are directly linked: \[ H(\hat{Q}) \approx \frac{1}{n} \log W \]

Recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \] With this \[ \begin{split} \log W &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \log (\hat{q}_k) \, \hat{q}_k \end{split} \]

The above combinatorial derivation of entropy is one of the cornerstones of statistical mechanics and is credited to Boltzmann (1877) and Gibbs (1878). The number of elements \(n_1, \ldots, n_K\) in each of the \(K\) classes corresponds to the macrostate and any of the \(W\) different allocations of the \(n\) elements to the \(K\) classes to an underlying microstate. The multinomial coefficient, and hence entropy, is largest when there are only small differences (or none) among the \(n_i\), i.e. when samples are equally spread across the \(K\) bins.

In statistics the above derivation of entropy was rediscovered by Wallis (1962).

2.4 \({\color{Red} \blacktriangleright}\) Maximum entropy principle to characterise distributions

Both Shannon entropy and differential entropy are useful to characterise distributions:

As seen in the examples above, large entropy implies that the distribution is spread out whereas small entropy indicates that the distribution is concentrated.

Correspondingly, maximum entropy distributions can be considered minimally informative about a random variable. The higher the entropy the more spread out (and hence more uninformative) the distribution. Conversely, low entropy implies that the probability mass is concentrated and thus the distribution is more informative about the random variable.

Examples:

The discrete uniform distribution is the maximum entropy distribution among all discrete distributions.
the maximum entropy distribution of a continuous random variable with support \([-\infty, \infty]\) with a specific mean and variance is the normal distribution.
the maximum entropy distribution among all continuous distributions supported in \([0, \infty]\) with a specified mean is the exponential distribution.

Using maximum entropy to characterise maximally uninformative distributions was advocated by Edwin T. Jaynes (1922–1998) (who also proposed to use maximum entropy in the context of finding Bayesian priors). The maximum entropy principle in statistical physics goes back to Boltzmann.

A list of maximum entropy distribution is given here: https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution.

Many distributions commonly used in statistical modelling are exponential families. Intriguingly, these distribution are all maximum entropy distributions, so there is a very close link between the principle of maximum entropy and common model choices in statistics and machine learning.

2.5 Cross-entropy and KL divergence

Definition of cross-entropy

If we modify the definition of entropy such that the expectation is taken with regard to a different distribution \(Q\) we arrive at the cross-entropy³ \[ \begin{split} H(Q, P) & =\text{E}_Q\left( S(x, P) \right)\\ & = -\text{E}_Q\left( \log p(x) \right)\\ \end{split} \] i.e. the expected logarithmic score when the data are generated by \(Q\) and model \(P\) is evaluated on the observations. Thus, cross-entropy is a functional of two distributions \(Q\) and \(P\).

For two discrete distributions \(Q\) and \(P\) with probability mass functions \(q(x)\) and \(p(x)\) with \(x\in \Omega\) the cross-entropy is computed as the weighted sum \[ H(Q, P) = - \sum_{x \in \Omega} \log p(x) \, q(x) \] It can be interpreted as the expected cost or expected code length when when the data are generated according to model \(Q\) and but we use model \(P\) to describe the data.

For two continuous distributions \(Q\) and \(P\) with densities \(q(x)\) and \(p(x)\) we compute the integral \[H(Q, P) =- \int_x \log p(x)\, q(x) \, dx\]

Note that

Cross-entropy is not symmetric with regard to \(Q\) and \(P\), because the expectation is taken with reference to \(Q\).
By construction if both distributions \(Q\) and \(P\) are identical cross-entropy reduces to entropy, i.e. \(H(Q, Q) = H(Q)\).
Like entropy cross-entropy changes under variable transformation for continuous random variables, say from \(x\) to \(y\), hence \(H(Q_y, P_y) \neq H(Q_x, P_x)\).

A crucial property of the cross-entropy \(H(Q, P)\) is that it is bounded below by the entropy of \(Q\), therefore \[ H(Q, P) \geq H(Q) \] with equality only if \(Q=P\). This is known as Gibbs’ inequality. For a proof see Worksheet E1.

Essentially this means that when data are generated under model \(Q\) and encoded with model \(P\) there is always an extra cost, or penalty, to use model \(P\) rather than the correct model \(Q\).

Example 2.13 \({\color{Red} \blacktriangleright}\) Scoring rules and Gibbs’ inequality:

The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is called proper because the corresponding expected score, i.e. the cross-entropy \(H(Q, P)\), satisfies the Gibbs’ inequality, and thus the expected score is minimised for \(P=Q\).

Definition and properties of KL divergence

The KL divergence is defined as \[ \begin{split} D_{\text{KL}}(Q,P) &= H(Q, P)-H(Q) \\ & = \text{E}_Q \left( S(x, P) - S(x, Q) \right) \\ & = \text{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ \end{split} \] Hence, KL divergence \(D_{\text{KL}}(Q,P)\) is simply a recalibrated cross-entropy, but it is arguably more fundamental than both entropy and cross-entropy.

The KL divergence \(D_{\text{KL}}(Q,P)\) is the expected difference in logarithmic scores when the data are generated by \(Q\) and models \(P\) and \(Q\) are evaluated on the observations. \(D_{\text{KL}}(Q, P)\) can be interpreted as the additional cost if \(P\) is used instead \(Q\) to describe data from \(Q\). If \(Q\) and \(P\) are identical there is no extra cost and \(D_{\text{KL}}(Q,P)=0\). Conversely, if they are not identical then there is an additional cost and \(D_{\text{KL}}(Q,P)> 0\).

\(D_{\text{KL}}(Q,P)\) thus serves as a measure of the divergence⁴ of distribution \(P\) from distribution \(Q\). The use of the term “divergence” rather than “distance” is a reminder the distributions \(Q\) and \(P\) are not interchangeable in \(D_{\text{KL}}(Q, P)\).

The KL divergence has a number of important properties inherited from cross-entropy:

\(D_{\text{KL}}(Q, P) \neq D_{\text{KL}}(Q, P)\), i.e. the KL divergence is not symmetric, \(Q\) and \(P\) cannot be interchanged. This follows from the same property of cross-entropy.
\(D_{\text{KL}}(Q, P)\geq 0\), follows from Gibbs’ inequality and proof via Jensen’s inequality.
\(D_{\text{KL}}(Q, P) = 0\) if and only if \(P=Q\), i.e., the KL divergence is zero if and only if \(Q\) and \(P\) are identical. Also follows from Gibbs’ inequality.

For more details and proofs of properties 2 and 3 see Worksheet E1.

Invariance property of KL divergence

A further crucial property of KL divergence is the invariance property:

\(D_{\text{KL}}(Q, P)\) is invariant under general variable transformations, with \(D_{\text{KL}}(Q_y, P_y) =D_{\text{KL}}(Q_x, P_x)\) under a change of variables from \(x\) to \(y\).

Thus, KL divergence does not change when the sample space is reparameterised.

In the definition of KL divergence the expectation is taken over a ratio of densities (or ratio of probabilities for discrete random variables). This creates the invariance under variable transformation as the Jacobian determinant that changes both densities cancel out.

For more details and proof of the invariance property see Worksheet E1.

Origin of KL divergence and naming conventions

Historically, KL divergence was first discovered by Boltzmann (1878) ⁵ in physics in a discrete setting (see Example 2.20).

In statistics and information theory it was introduced by Kullback and Leibler (1951) ⁶ but note that Good (1979) ⁷ credits Turing with the first statistical application in 1940/1941 in the field of cryptography.

The KL divergence is also known as KL information or KL information number named after two of the original authors (Kullback and Leibler) who themselves referred to this quantity as discrimination information. Another common name is information divergence or short \(\symbfit I\)-divergence. Some authors (e.g. Efron) call twice the KL divergence \(2 D_{\text{KL}}(Q, P) = D(Q, P)\) the deviance of \(P\) from \(Q\). In the more general context of scoring rules the divergence is also called discrepancy. Furthermore, KL divergence is also very frequently called relative entropy. However, especially in older literature the KL divergence is referred to as “cross-entropy” but this use is discouraged to avoid confusion with the related but distinct definition of cross-entropy above.

Perhaps it should be called Boltzmann-Turing-Kullback-Leibler information divergence or short BTKL divergence.

There also exist various notations for KL divergence in the literature. Here we use \(D_{\text{KL}}(Q, P)\) but you often find as well \(\text{KL}(Q || P)\) and \(I^{KL}(Q; P)\).

Application in statistics

In statistics the typical roles of the distribution \(Q\) and \(P\) in \(D_{\text{KL}}(Q, P)\) are:

\(Q\) is the (unknown) underlying true model for the data generating process
\(P\) is the approximating model (typically a parametric distribution family)

Optimising (i.e. minimising) the KL divergence with regard to \(P\) amounts to approximation and optimising with regard to \(Q\) to imputation. Later we will see how this leads to the method of maximum likelihood and to Bayesian learning, respectively.

2.6 Cross-entropy and KL divergence examples

Models with a single parameter

Example 2.14 KL divergence between two Bernoulli distributions \(\text{Ber}(\theta_1)\) and \(\text{Ber}(\theta_2)\):

The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(\text{Ber}(\theta_1), \text{Ber}(\theta_2))=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]

Example 2.15 KL divergence between two univariate normals with different means and common variance:

Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2)\) and \(F=N(\mu,\sigma^2)\).

Then we get \[D_{\text{KL}}(F_{\text{ref}}, F )=\frac{1}{2} \left(\frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2}\right)\]

Thus, the squared Euclidean distance is a special case of KL divergence. Note that in this case the KL divergence is symmetric.

Models with multiple parameters

Example 2.16 KL divergence between two categorical distributions with \(K\) classes:

With \(Q=\text{Cat}(\symbfit q)\) and \(P=\text{Cat}(\symbfit p)\) and corresponding probabilities \(q_1,\dots,q_K\) and \(p_1,\dots,p_K\) satisfying \(\sum_{i=1}^K q_i = 1\) and \(\sum_{i=1}^K p_i =1\) we get:

\[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^K q_i\log\left(\frac{q_i}{p_i}\right) \end{equation*}\]

To be explicit that there are only \(K-1\) parameters in a categorical distribution we can also write \[\begin{equation*} D_{\text{KL}}(Q, P)=\sum_{i=1}^{K-1} q_i\log\left(\frac{q_i}{p_i}\right) + q_K\log\left(\frac{q_K}{p_K}\right) \end{equation*}\] with \(q_K=\left(1- \sum_{i=1}^{K-1} q_i\right)\) and \(p_K=\left(1- \sum_{i=1}^{K-1} p_i\right)\).

Example 2.17 Cross-entropy between two normals:

Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). The cross-entropy \(H(F_{\text{ref}}, F)\) is \[ \begin{split} H(F_{\text{ref}}, F) &= -\text{E}_{F_{\text{ref}}} \left( \log p(x |\mu, \sigma^2) \right)\\ &= \frac{1}{2} \text{E}_{F_{\text{ref}}} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu - \mu_{\text{ref}})^2}{ \sigma^2 } +\frac{\sigma^2_{\text{ref}}}{\sigma^2} +\log(2 \pi \sigma^2) \right) \\ \end{split} \] using \(\text{E}_{F_{\text{ref}}} ((x-\mu)^2) = (\mu_{\text{ref}}-\mu)^2 + \sigma^2_{\text{ref}}\).

Example 2.18 Entropy as lower bound of cross-entropy:

If \(\mu_{\text{ref}} = \mu\) and \(\sigma^2_{\text{ref}} = \sigma^2\) then the cross-entropy \(H(F_{\text{ref}},F)\) in Example 2.17 degenerates to the differential entropy \(H(F_{\text{ref}}) = \frac{1}{2} \left(\log( 2 \pi \sigma^2_{\text{ref}}) +1 \right)\).

2.7 Expected Fisher information

Definition of expected Fisher information

KL information measures the divergence of two distributions. Previously we have seen examples of KL divergence between two distributions belonging to the same family. We now consider the KL divergence of two such distributions separated in parameter space only by some small \(\symbfit \varepsilon\).

Specifically, we consider the function \[h(\symbfit \varepsilon) = D_{\text{KL}}(F_{\symbfit \theta}, F_{\symbfit \theta+\symbfit \varepsilon}) = \text{E}_{F_{\symbfit \theta}}\left( \log f(\symbfit x| \symbfit \theta) - \log f(\symbfit x| \symbfit \theta+\symbfit \varepsilon) \right)\] where \(\symbfit \theta\) is kept constant and \(\symbfit \varepsilon\) is varying. Assuming that \(f(\symbfit x| \symbfit \theta)\) is twice differentiable with regard to \(\symbfit \theta\) we can approximate \(h(\symbfit \varepsilon)\) quadratically by \(h(\symbfit \varepsilon) \approx h(0) + \nabla h(0)^T\symbfit \varepsilon+ \frac{1}{2} \symbfit \varepsilon^T \, \nabla \nabla^T h(0) \,\symbfit \varepsilon\).

From the properties of the KL divergence we know that \(D_{\text{KL}}(F_{\symbfit \theta}, F_{\symbfit \theta+\symbfit \varepsilon})\geq 0\) and that it becomes zero only if \(\symbfit \varepsilon=0\). Thus, by construction the function \(h(\symbfit \varepsilon)\) achieves a true minimum at \(\symbfit \varepsilon=0\) (with \(h(0)=0\)), has a vanishing gradient at \(\symbfit \varepsilon=0\) and a positive definite Hessian matrix at \(\symbfit \varepsilon=0\). Therefore in the quadratic approximation of \(h(\symbfit \varepsilon)\) around \(\symbfit \varepsilon=0\) above the first two terms (constant and linear) vanish and only the quadratic term remains: \[ h(\symbfit \varepsilon) \approx \frac{1}{2} \symbfit \varepsilon^T \, \nabla \nabla^T h(0) \, \symbfit \varepsilon \] The Hessian matrix of \(h(\symbfit \varepsilon)\) evaluated at \(\symbfit \varepsilon=0\) is the negative expected Hessian matrix of the log-density at \(\symbfit \theta\) \[\symbfit I^{\text{Fisher}}(\symbfit \theta) = \nabla \nabla^T h(0) = -\text{E}_{F_{\symbfit \theta}} \nabla \nabla^T \log f(\symbfit x| \symbfit \theta)\] It is called the expected Fisher information at \(\symbfit \theta\), or short Fisher information. Hence, the KL divergence can be locally approximated by \[ D_{\text{KL}}(F_{\symbfit \theta}, F_{\symbfit \theta+\symbfit \varepsilon})\approx \frac{1}{2} \symbfit \varepsilon^T \symbfit I^{\text{Fisher}}(\symbfit \theta) \symbfit \varepsilon \]

We may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\symbfit \varepsilon\): \[ \begin{split} D_{\text{KL}}(F_{\symbfit \theta+\symbfit \varepsilon}, F_{\symbfit \theta}) &\approx \frac{1}{2}\symbfit \varepsilon^T \symbfit I^{\text{Fisher}}(\symbfit \theta)\, \symbfit \varepsilon\\ \end{split} \]

Hence, the KL divergence, while generally not symmetric in its arguments, is still locally symmetric.

Computing the expected Fisher information involves no observed data, it is purely a property of the model family \(F_{\symbfit \theta}\). In the next chapter we will study a related quantity, the observed Fisher information that in contrast to the expected Fisher information is a function of the observed data.

Example 2.21 \({\color{Red} \blacktriangleright}\) Fisher information as metric tensor:

In the field of information geometry⁸ sets of distributions are studied using tools from differential geometry. It turns out that distribution families are manifolds and that the expected Fisher information matrix plays the role of the (symmetric!) metric tensor on this manifold.

Additivity of Fisher information

We may wish to compute the expected Fisher information based on a set of independent identically distributed (iid) random variables.

Assume that a random variable \(x \sim F_{\symbfit \theta}\) has log-density \(\log f(x| \symbfit \theta)\) and expected Fisher information \(\symbfit I^{\text{Fisher}}(\symbfit \theta)\). The expected Fisher information \(\symbfit I_{x_1, \ldots, x_n}^{\text{Fisher}}(\symbfit \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim F_{\symbfit \theta}\) is computed from the joint log-density \(\log f(x_1, \ldots, x_n) = \sum_{i}^n \log f(x_i| \symbfit \theta)\). This yields \[ \begin{split} \symbfit I_{x_1, \ldots, x_n}^{\text{Fisher}}(\symbfit \theta) &= -\text{E}_{F_{\symbfit \theta}} \nabla \nabla^T \sum_{i}^n \log f(x_i| \symbfit \theta)\\ &= \sum_{i=1}^n \symbfit I^{\text{Fisher}}(\symbfit \theta) =n \symbfit I^{\text{Fisher}}(\symbfit \theta) \\ \end{split} \] Hence, the expected Fisher information for a set of \(n\) iid random variables is the \(n\) times the Fisher information of a single variable.

Invariance property of the Fisher information

Like KL divergence the expected Fisher information is invariant against change of parameterisation of the sample space, say from variable \(x\) to \(y\) and from distribution \(F_x\) to \(F_y\). This is easy to see as the KL divergence itself is invariant against such reparameterisation. Hence the function \(h(\symbfit \varepsilon)\) above is invariant and thus also its curvature, and hence the expected Fisher information.

More formally, when the sample space is changed the density gains a factor in the form of the Jacobian determinant for this transformation. However, as this factor does not depend of the model parameters it does not change the first and second derivative of the log-density with regard to the model parameters.

\({\color{Red} \blacktriangleright}\) Transformation of Fisher information when model parameters change

The Fisher information \(\symbfit I^{\text{Fisher}}(\symbfit \theta)\) depends on the parameter \(\symbfit \theta\). If we use a different parameterisation of the underlying parametric distribution family, say \(\symbfit \zeta\) with a map \(\symbfit \theta(\symbfit \zeta)\) from \(\symbfit \zeta\) to \(\symbfit \theta\), then the Fisher information changes according to the chain rule in calculus.

To find the resulting Fisher information in terms of the new parameter \(\symbfit \zeta\) we need to use the Jacobian matrix \(D \symbfit \theta(\symbfit \zeta)\). This matrix contains the gradients for each component of the map \(\symbfit \theta(\symbfit \zeta)\) in its rows: \[ D \symbfit \theta(\symbfit \zeta) = \begin{pmatrix}\nabla^T \theta_1(\symbfit \zeta)\\ \nabla^T \theta_2(\symbfit \zeta) \\ \vdots \\ \end{pmatrix} \]

With the above the Fisher information for \(\symbfit \theta\) is then transformed to the Fisher information for \(\symbfit \zeta\) applying the chain rule for the Hessian matrix: \[ \symbfit I^{\text{Fisher}}(\symbfit \zeta) = (D \symbfit \theta(\symbfit \zeta))^T \, \symbfit I^{\text{Fisher}}(\symbfit \theta) \rvert_{\symbfit \theta= \symbfit \theta(\symbfit \zeta)} \, D \symbfit \theta(\symbfit \zeta) \] This type of transformation is also known as covariant transformation, in this case for the Fisher information metric tensor.

2.8 Expected Fisher information examples

Models with a single parameter

Example 2.22 Expected Fisher information for the Bernoulli distribution:

The log-probability mass function of the Bernoulli \(\text{Ber}(\theta)\) distribution is \[ \log p(x | \theta) = x \log(\theta) + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\text{E}(x) = \theta\) we get as Fisher information \[ \begin{split} I^{\text{Fisher}}(\theta) & = -\text{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]

Example 2.23 Quadratic approximations of the KL divergence between two Bernoulli distributions:

From Example 2.14 we have as KL divergence \[ D_{\text{KL}}\left (\text{Ber}(\theta_1), \text{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 2.22 the corresponding expected Fisher information.

The quadratic approximation implies that \[ D_{\text{KL}}\left( \text{Ber}(\theta), \text{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \text{Ber}(\theta+\varepsilon), \text{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]

In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.

Example 2.24 Expected Fisher information for the normal distribution \(N(\mu, \sigma^2)\) with known variance.

The log-density is \[ \log f(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The second derivative with respect to \(\mu\) is \[ \frac{d^2}{d\mu^2} \log f(x | \mu, \sigma^2) = -\frac{1}{\sigma^2} \] Therefore the expected Fisher information is \[ \symbfit I^{\text{Fisher}}\left(\mu\right) = \frac{1}{\sigma^2} \]

Models with multiple parameters

Example 2.25 Expected Fisher information for the normal distribution \(N(\mu, \sigma^2)\).

The log-density is \[ \log f(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log f(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.

The corresponding Hessian matrix is \[ \nabla \nabla^T \log f(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\text{E}(x) = \mu\) we have \(\text{E}(x-\mu) =0\). Furthermore, with \(\text{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\text{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the expected Fisher information matrix as the negative expected Hessian matrix is \[ \symbfit I^{\text{Fisher}}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]

Example 2.26 \({\color{Red} \blacktriangleright}\) Expected Fisher information of the categorical distribution:

The log-probability mass function for the categorical distribution with \(K\) classes and \(K-1\) free parameters \(\pi_1, \ldots, \pi_{K-1}\) is \[ \begin{split} \log p(\symbfit x| \pi_1, \ldots, \pi_{K-1} ) & =\sum_{k=1}^{K-1} x_k \log \pi_k + x_K \log \pi_K \\ & =\sum_{k=1}^{K-1} x_k \log \pi_k + \left( 1 - \sum_{k=1}^{K-1} x_k \right) \log \left( 1 - \sum_{k=1}^{K-1} \pi_k \right) \\ \end{split} \]

From the log-probability mass function we compute the Hessian matrix of second order partial derivatives \(\nabla \nabla^T \log p(\symbfit x| \pi_1, \ldots, \pi_{K-1} )\) with regard to \(\pi_1, \ldots, \pi_{K-1}\):

The diagonal entries of the Hessian matrix (with \(i=1, \ldots, K-1\)) are \[ \frac{\partial^2}{\partial \pi_i^2} \log p(\symbfit x|\pi_1, \ldots, \pi_{K-1}) = -\frac{x_i}{\pi_i^2}-\frac{x_K}{\pi_K^2} \]
the off-diagonal entries are (with \(j=1, \ldots, K-1\) and \(j \neq i\)) \[ \frac{\partial^2}{\partial \pi_i \partial \pi_j} \log p(\symbfit x|\pi_1, \ldots, \pi_{K-1}) = -\frac{ x_K}{\pi_K^2} \]

Recalling that \(\text{E}(x_i) = \pi_i\) we obtain the expected Fisher information matrix for a categorical distribution as \(K-1 \times K-1\) dimensional matrix \[ \begin{split} \symbfit I^{\text{Fisher}}\left( \pi_1, \ldots, \pi_{K-1} \right) &= -\text{E}\left( \nabla \nabla^T \log p(\symbfit x| \pi_1, \ldots, \pi_{K-1}) \right) \\ & = \begin{pmatrix} \frac{1}{\pi_1} + \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_K} \\ \vdots & \ddots & \vdots \\ \frac{1}{\pi_K} & \cdots & \frac{1}{\pi_{K-1}} + \frac{1}{\pi_K} \\ \end{pmatrix}\\ & = \text{Diag}\left( \frac{1}{\pi_1} , \ldots, \frac{1}{\pi_{K-1}} \right) + \frac{1}{\pi_K} \symbfup 1\\ \end{split} \]

For \(K=2\) and \(\pi_1=\theta\) this reduces to the expected Fisher information of a Bernoulli variable, see Example 2.22. \[ \begin{split} I^{\text{Fisher}}(\theta) & = \left(\frac{1}{\theta} + \frac{1}{1-\theta} \right) \\ &= \frac{1}{\theta (1-\theta)} \\ \end{split} \]

Example 2.27 \({\color{Red} \blacktriangleright}\) Quadratic approximation of KL divergence of the categorical distribution and the Neyman and Pearson divergence:

We now consider the local approximation of the KL divergence \(D_{\text{KL}}(Q, P)\) between the categorical distribution \(Q=\text{Cat}(\symbfit q)\) with probabilities \(\symbfit q=(q_1, \ldots, q_K)^T\) with the categorical distribution \(P=\text{Cat}(\symbfit p)\) with probabilities \(\symbfit p= (p_1, \ldots, p_K)^T\).

From Example 2.16 we already know the KL divergence and from Example 2.26 the corresponding expected Fisher information.

First, we keep the first argument \(Q\) fixed and assume that \(P\) is a perturbed version of \(Q\) with \(\symbfit p= \symbfit q+\symbfit \varepsilon\). Note that the perturbations \(\symbfit \varepsilon=(\varepsilon_1, \ldots, \varepsilon_K)^T\) satisfy \(\sum_{k=1}^K \varepsilon_k = 0\) because \(\sum_{k=1}^K q_i=1\) and \(\sum_{k=1}^K p_i=1\). Thus \(\varepsilon_K = -\sum_{k=1}^{K-1} \varepsilon_k\). Then \[ \begin{split} D_{\text{KL}}(\text{Cat}(\symbfit q), \text{Cat}(\symbfit q+\symbfit \varepsilon)) & \approx \frac{1}{2} (\varepsilon_1, \ldots, \varepsilon_{K-1}) \, \symbfit I^{\text{Fisher}}\left( q_1, \ldots, q_{K-1} \right) \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_{K-1}\\ \end{pmatrix} \\ &= \frac{1}{2} \left( \sum_{k=1}^{K-1} \frac{\varepsilon_k^2}{q_k} + \frac{ \left(\sum_{k=1}^{K-1} \varepsilon_k\right)^2}{q_K} \right) \\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{\varepsilon_k^2}{q_k}\\ &= \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{q_k}\\ & = \frac{1}{2} D_{\text{Neyman}}(Q, P)\\ \end{split} \] Similarly, if we keep \(P\) fixed and consider \(Q\) as a perturbed version of \(P\) we get \[ \begin{split} D_{\text{KL}}(\text{Cat}(\symbfit p+\symbfit \varepsilon), \text{Cat}(\symbfit p)) &\approx \frac{1}{2} \sum_{k=1}^{K} \frac{(q_k-p_k)^2}{p_k}\\ &= \frac{1}{2} D_{\text{Pearson}}(Q, P) \end{split} \] Note that in both approximations we divide by the probabilities of the distribution that is kept fixed.

Note the appearance of the Pearson \(\chi^2\) divergence and the Neyman \(\chi^2\) divergence in the above. Both are, like the KL divergence, part of the family of \(f\)-divergences. The Neyman \(\chi^2\) divergence is also known as the reverse Pearson divergence as \(D_{\text{Neyman}}(Q, P) = D_{\text{Pearson}}(P, Q)\).

We use the convention that scoring rules are negatively oriented (e.g. Dawid 2007) with the aim to minimise the score (cost, code length, surprise). However, some authors prefer the positively oriented convention with a reversed sign in the definition of \(S(x, P)\) so the score represents a reward that is maximised (e.g. Gneiting and Raftery 2007).↩︎
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x ↩︎
This follows the current and widely accepted usage of the term cross-entropy. However, in some typically older literature cross-entropy may refer instead to the related but different KL divergence discussed further below.↩︎
Note that divergence between distributions is not related to and should not be confused with the divergence vector operator used in vector calculus.↩︎
Boltzmann, L. 1878. Weitere Bemerkungen über einige Probleme der mechanischen Wärmetheorie. Wien Ber. 78:7–46. https://doi.org/10.1017/CBO9781139381437.013 ↩︎
Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statist. 22 79–86. https://doi.org/10.1214/aoms/1177729694 ↩︎
Good, I. J. 1979. Studies in the history of probability. XXXVII. A. M. Turing’s statistical work in world war II. Biometrika, 66:393–396. https://doi.org/10.1093/biomet/66.2.393 ↩︎
A recent review is given, e.g., in: Nielsen, F. 2020. An elementary introduction to information geometry. Entropy 22:1100. https://doi.org/10.3390/e22101100 ↩︎