2 From entropy to maximum likelihood

2.1 Entropy

2.1.1 Overview

In this chapter we discuss various information criteria and their connection to maximum likelihood.

The modern definition of (relative) entropy, or “disorder”, was first discovered in the 1870s by physicist L. Boltzmann (1844–1906) in the context of thermodynamics. The probabilistic interpretation of statistical mechanics and entropy was further developed by J. W. Gibbs (1839–1903).

In the 1940–1950’s the notion of entropy turned out to be central in information theory, a field pioneered by mathematicians such as R. Hartley (1888–1970), S. Kullback (1907–1994), A. Turing (1912–1954), R. Leibler (1914–2003), I. J. Good (1916–2009), C. Shannon (1916–2001), and E. T. Jaynes (1922–1998), and later further explored by S. Amari (1936–), I. Ciszár (1938–), B. Efron (1938–), A. P. Dawid (1946–) and many others.

\[\begin{align*} \left. \begin{array}{cc} \\ \textbf{Entropy} \\ \\ \end{array} \right. \left. \begin{array}{cc} \\ \nearrow \\ \searrow \\ \\ \end{array} \right. \begin{array}{ll} \text{Shannon Entropy} \\ \\ \text{Relative Entropy} \\ \end{array} \begin{array}{ll} \text{(Shannon 1948)} \\ \\ \text{(Kullback-Leibler 1951)} \\ \end{array} \end{align*}\]

\[\begin{align*} \left. \begin{array}{ll} \text{Fisher information} \\ \\ \text{Mutual Information} \\ \end{array} \right. \begin{array}{ll} \rightarrow\text{ Likelihood theory} \\ \\ \rightarrow\text{ Information theory} \\ \end{array} \begin{array}{ll} \text{(Fisher 1922)} \\ \\ \text{(Shannon 1948, Lindley 1953)} \\ \end{array} \end{align*}\]

2.1.2 Surprise, surprisal or Shannon information

The surprise to observe an event of probability \(p\) is defined as \(-\log(p)\). This is also called surprisal or Shannon information.

Thus, the surprise to observe a certain event (with \(p=1\)) is zero, and conversely the surprise to observe an event that is certain not to happen (with \(p=0\)) is infinite.

The log-odds ratio can be viewed as the difference of the surprise of an event and the surprise of the complementary event: \[ \log\left( \frac{p}{1-p} \right) = -\log(1-p) - ( -\log(p)) \]

In this module we always use the natural logarithm by default, and will explicitly write \(\log_2\) and \(\log_{10}\) for logarithms with respect to base 2 and 10, respectively.

Surprise and entropy computed with the natural logarithm (\(\log\)) is given in “nats” (=natural information units ). Using \(\log_2\) leads to “bits” and using \(\log_{10}\) to “ban” or “Hartley”.

2.1.3 Shannon entropy

Assume we have a categorical distribution \(P\) with \(K\) classes/categories. The corresponding class probabilities are \(p_1, \ldots, p_K\) with \(\text{Pr}(\text{"class k"}) = p_k\) and \(\sum_{k=1}^K p_k= 1\). The probability mass function (PMF) is \(p(x = \text{"class k"})= p_k\).

As the random variable \(x\) is discrete the categorical distribution \(P\) is a discrete distribution but \(P\) is generally also known as the discrete distribution.

The Shannon entropy (1948) 1 of the distribution \(P\) is defined as the expected surprise, i.e. the negative expected log-probability \[ \begin{split} H(P) &=-\text{E}_P\left(\log p(x)\right) \\ &= - \sum_{k=1}^{K}p_k \log(p_k) \\ \end{split} \] As all \(p_k \in [0,1]\) by construction Shannon entropy must be larger or equal to 0.

Furthermore, it is bounded above by \(\log K\). This can be seen by maximising Shannon entropy as a function with regard to the \(p_k\) under the constraint \(\sum_{k=1}^K p_k= 1\), e.g., by constrained optimisation using Langrange multipliers. The maximum is achieved for \(P\) being the discrete uniform - see Example 2.1.

Hence for any categorical distribution \(P\) with \(K\) categories we have \[\log K \geq H(P) \geq 0\]

In statistical physics, the Shannon entropy is known as Gibbs entropy (1878).

Example 2.1 Discrete uniform distribution \(U_K\): let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\frac{1}{K} \log\left(\frac{1}{K}\right) = \log K\]

Note this is the largest value the Shannon entropy can assume with \(K\) classes.

Example 2.2 Concentrated probability mass: let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon entropy \[H(P) = 1\times\log(1) + 0\times\log(0) + \dots = 0\]

Note that 0 is the smallest value that Shannon entropy can assume, and corresponds to maximum concentration.

Thus, large entropy implies that the distribution is spread out whereas small entropy means the distribution is concentrated.

Correspondingly, maximum entropy distributions can be considered minimally informative about a random variable.

This interpretation is also supported by the close link of Shannon entropy with multinomial coefficients counting the permutations of \(n\) items (samples) of \(K\) distinct types (classes).

Example 2.3 Large sample asymptotics of the log-multinomial coefficient and link to Shannon entropy:

The number of possible permutation of \(n\) items of \(K\) distinct types, with \(n_1\) of type 1, \(n_2\) of type 2 and so on, is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \] with \(\sum_{k=1}^K n_k = n\) and \(K \leq n\).

Now recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \]

With this \[ \begin{split} \log W &= \log \binom{n}{n_1, \ldots, n_K} \\ &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = n \log n - \sum_{k=1}^K n_k \log n_k\\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ \end{split} \] and thus \[ \begin{split} \frac{1}{n}\log \binom{n}{n_1, \ldots, n_K} &\approx - \sum_{k=1}^K \hat{p}_k \log \hat{p}_k \\ &=H(\hat{P}) \end{split} \] where \(\hat{P}\) is the empirical categorical distribution with \(\hat{p}_k = \frac{n_k}{n}\).

The combinatorical derivation of Shannon entropy is now credited to Wallis (1962) but has already been used a century earlier by Boltzmann (1877) who discovered it in his work in statistical mechanics (recall \(S = k_b \log W\) is the Boltzmann entropy ).

2.1.4 Differential entropy

Shannon entropy is only defined for discrete random variables.

Differential Entropy results from applying the definition of Shannon entropy to a continuous random variable \(x\) with density \(p(x)\): \[ H(P) = -\text{E}_P(\log p(x)) = - \int_x p(x) \log p(x) \, dx \] Despite having essentially the same formula the different name is justified because differential entropy exhibits different properties compared to Shannon entropy, because the logarithm is taken of a density which in contrast to a probability can assume values larger than one. As a consequence, differential entropy is not bounded below by zero and can be negative.

Example 2.4 Consider the uniform distribution \(U(0, a)\) with \(a>0\), support from \(0\) to \(a\) and density \(p(x) = 1/a\). As \(-\int_0^a p(x) \log p(x) dx =- \int_0^a \frac{1}{a} \log(\frac{1}{a}) dx = \log a\) the differential entropy is \[H( U(0, a) ) = \log a \,.\] Note that for \(a < 1\) the differential entropy is negative.

Example 2.5 The log density of the univariate normal \(N(\mu, \sigma^2)\) distribution is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Interestingly, \(H(P)\) only depends on the variance and not on the mean, and the entropy grows with the variance. Note that for \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.

2.1.5 Maximum entropy principle to characterise distributions

Both maximum Shannon entropy and differential entropy are useful to characterise distributions:

  1. The discrete uniform distribution is the maximum entropy distribution among all discrete distributions.

  2. the maximum entropy distribution of a continuous random variable with support \([-\infty, \infty]\) with a specific mean and variance is the normal distribution

  3. the maximum entropy distribution among all continuous distributions supported in \([0, \infty]\) with a specified mean is the exponential distribution.

The higher the entropy the more spread out (and more uninformative) is a distribution.

Using maximum entropy to characterise maximally uniformative distributions was advocated by E.T. Jaynes (who also proposed to use maximum entropy in the context of finding Bayesian priors). The maximum entropy principle in statistical physics goes back to Boltzmann.

A list of maximum entropy distribution is given here: https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution .

Many distributions commonly used in statistical modelling are exponential families. Intriguingly, these distribution are all maximum entropy distributions, so there is a very close link between the principle of maximum entropy and common model choices in statistics and machine learning.

2.1.6 Cross-entropy

If in the definition of Shannon entropy (and differential entropy) the expectation over the log-density (say \(g(x)\) of distribution \(G\)) is taken with regard to a different distribution \(F\) over the same state space we arrive at the cross-entropy \[ H(F, G) =-\text{E}_F\left( \log g(x) \right) \] For discrete distributions \(F\) and \(G\) with class probabilities \(f_1, \ldots, f_K\) and \(g_1, \ldots, g_K\) the cross-entropy is computed as the weighted sum \(H(F, G) = - \sum_{k=1}^{K} f_k \log g_k\). For continuous distributions \(F\) and \(G\) with densities \(f(x)\) and \(g(x)\) we compute the integral \(H(F, G) =- \int_x f(x) \log g(x) \, dx\).

Therefore, cross-entropy is a measure linking two distributions \(F\) and \(G\).

Note that

  • cross-entropy is not symmetric with regard to \(F\) and \(G\), because the expectation is taken with reference to \(F\).
  • By construction \(H(F, F) = H(F)\). Thus if both distributions are identical cross-entropy reduces to Shannon and differential entropy, respectively.

A crucial property of the cross-entropy \(H(F, G)\) is that it is bounded below by the entropy of \(F\), therefore \[ H(F, G) \geq H(F) \] with equality for \(F=G\). This is known as Gibbs’ inequality.

Equivalently we can write \[ \underbrace{H(F, G)-H(F)}_{\text{relative entropy}} \geq 0 \]
In fact, this recalibrated cross-entropy (known as KL divergence or relative entropy) turns out to be more fundamental than both cross-entropy and entropy. It will be studied in detail in the next section.

Example 2.6 Cross-entropy between two normals:

Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). The cross-entropy \(H(F_{\text{ref}}, F)\) is \[ \begin{split} H(F_{\text{ref}}, F) &= -\text{E}_{F_{\text{ref}}} \left( \log p(x |\mu, \sigma^2) \right)\\ &= \frac{1}{2} \text{E}_{F_{\text{ref}}} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right) \\ &= \frac{1}{2} \left( \frac{(\mu - \mu_{\text{ref}})^2}{ \sigma^2 } +\frac{\sigma^2_{\text{ref}}}{\sigma^2} +\log(2 \pi \sigma^2) \right) \\ \end{split} \] using \(\text{E}_{F_{\text{ref}}} ((x-\mu)^2) = (\mu_{\text{ref}}-\mu)^2 + \sigma^2_{\text{ref}}\).

Example 2.7 If \(\mu_{\text{ref}} = \mu\) and \(\sigma^2_{\text{ref}} = \sigma^2\) then the cross-entropy \(H(F_{\text{ref}},F)\) in Example 2.6 degenerates to the differential entropy \(H(F_{\text{ref}}) = \frac{1}{2} \left(\log( 2 \pi \sigma^2_{\text{ref}}) +1 \right)\).

2.2 Kullback-Leibler divergence

2.2.1 Definition

Also known as relative entropy and discrimination information.

The relative entropy measures the divergence of a distribution \(G\) from the distribution \(F\) and is defined as \[ \begin{split} D_{\text{KL}}(F,G) &= \text{E}_F\log\left(\frac{dF}{dG}\right) \\ & = \text{E}_F\log\left(\frac{f(x)}{g(x)}\right) \\ & = \underbrace{-\text{E}_F(\log g(x))}_{\text{cross-entropy}} - (\underbrace{-\text{E}_F (\log f(x)) }_\text{(differential) entropy}) \\ & = H(F, G)-H(F) \\ \end{split} \]

  • \(D_{\text{KL}}(F, G)\) measures the amount of information lost if \(G\) is used to approximate \(F\).
  • If \(F\) and \(G\) are identical (and no information is lost) then \(D_{\text{KL}}(F,G)=0\).

(Note: here “divergence” measures the dissimilarity between probability distributions. This type of divergence is not related and should not be confused with divergence (div) as used in vector analysis.)

The use of the term “divergence” rather than “distance” serves to emphasise that the distributions \(F\) and \(G\) are not interchangeable in \(D_{\text{KL}}(F, G)\).

There exist various notations for KL divergence in the literature. Here we use \(D_{\text{KL}}(F, G)\) but you often find as well \(\text{KL}(F || G)\) and \(I^{KL}(F; G)\).

Some authors (e.g. Efron) call twice the KL divergence \(2 D_{\text{KL}}(F, G) = D(F, G)\) the deviance of \(G\) from \(F\).

2.2.2 Properties of KL divergence

  1. \(D_{\text{KL}}(F, G) \neq D_{\text{KL}}(G, F)\), i.e. the KL divergence is not symmetric, \(F\) and \(G\) cannot be interchanged.
  2. \(D_{\text{KL}}(F, G) = 0\) if and only if \(F=G\), i.e., the KL divergence is zero if and only if \(F\) and \(G\) are identical.
  3. \(D_{\text{KL}}(F, G)\geq 0\), proof via Jensen’s inequality.
  4. \(D_{\text{KL}}(F, G)\) remains invariant under coordinate transformations, i.e. it is an invariant geometric quantity.

Note that in the KL divergence the expectation is taken over a ratio of densities (or ratio of probabilities for discrete random variables). This is what creates the transformation invariance.

For more details and proofs of properties 3 and 4 see Worksheet E1.

2.2.3 Origin of KL divergence and application in statistics

Historically, in physics (negative) relative entropy was discovered by Boltzmann (1878). 2 In statistics and information theory it was introduced by Kullback and Leibler (1951). 3

In statistics the typical roles of the distribution \(F\) and \(G\) in \(D_{\text{KL}}(F, G)\) are:

  • \(F\) is the (unknown) underlying true model for the data generating process
  • \(G\) is the approximating model (typically a distribution family indexed by parameters)

Optimising (i.e. minimising) the KL divergence with regard to \(G\) amounts to approximation and optimising with regard to \(F\) to imputation. Later we will see how this leads to the method of maximum likelihood and to Bayesian learning, respectively.

2.2.4 KL divergence examples

Example 2.8 KL divergence between two Bernoulli distributions \(\text{Ber}(\theta_1)\) and \(\text{Ber}(\theta_2)\):

The “success” probabilities for the two distributions are \(\theta_1\) and \(\theta_2\), respectively, and the complementary “failure” probabilities are \(1-\theta_1\) and \(1-\theta_2\). With this we get for the KL divergence \[ D_{\text{KL}}(\text{Ber}(\theta_1), \text{Ber}(\theta_2))=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \]

Example 2.9 KL divergence between two univariate normals with different means and variances:

Assume \(F_{\text{ref}}=N(\mu_{\text{ref}},\sigma^2_{\text{ref}})\) and \(F=N(\mu,\sigma^2)\). Then \[ \begin{split} D_{\text{KL}}(F_{\text{ref}},F) &= H(F_{\text{ref}}, F) - H(F_{\text{ref}}) \\ &= \frac{1}{2} \left( \frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2} + \frac{\sigma_{\text{ref}}^2}{\sigma^2} -\log\left(\frac{\sigma_{\text{ref}}^2}{\sigma^2}\right)-1 \right) \\ \end{split} \]

Example 2.10 KL divergence between two univariate normals with different means and common variance:

An important special case of the previous Example 2.9 occurs if the variances are equal. Then we get \[D_{\text{KL}}(N(\mu_{\text{ref}}, \sigma^2), N(\mu, \sigma^2) )=\frac{1}{2} \left(\frac{(\mu-\mu_{\text{ref}})^2}{\sigma^2}\right)\]

2.3 Local quadratic approximation and expected Fisher information

2.3.1 Definition of expected Fisher information

KL information measures the divergence of two distributions. We may thus use relative entropy to measure the divergence between two distributions in the same family, separated in parameter space only by some small \(\boldsymbol \varepsilon\).

First, we consider \(D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon}) = \text{E}_{F_{\boldsymbol \theta}}\left( \log f(\boldsymbol x| \boldsymbol \theta) - \log f(\boldsymbol x| \boldsymbol \theta+\boldsymbol \varepsilon) \right) = h(\boldsymbol \varepsilon)\) where \(\boldsymbol \theta\) is kept constant and \(\boldsymbol \varepsilon\) is varying. From the properties of the KL divergence we know that \(D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon})\geq 0\) and that it becomes zero only if \(\boldsymbol \varepsilon=0\). Thus, by construction the function \(h(\boldsymbol \varepsilon)\) achieves a true minimum of \(h(0)=0\) at \(\boldsymbol \varepsilon=0\), with a vanishing gradient \(\nabla h(0)=0\) and a positive definite Hessian matrix \(\nabla \nabla^T h(0)\). Therfore we can approximate it by a quadratic function around \(\boldsymbol \varepsilon=0\): \[ h(\boldsymbol \varepsilon) \approx \frac{1}{2} \boldsymbol \varepsilon^T \, \nabla \nabla^T h(0) \, \boldsymbol \varepsilon \] The Hessian matrix \(\nabla \nabla^T h(0)\) is computed as \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) = -\text{E}_{F_{\boldsymbol \theta}} \nabla \nabla^T \log f(\boldsymbol x| \boldsymbol \theta)\) and is the negative expected Hessian matrix of the log-density at \(\boldsymbol \theta\). It is called the expected Fisher information at \(\boldsymbol \theta\). The KL divergence can thus be locally be approximated by \[ D_{\text{KL}}(F_{\boldsymbol \theta}, F_{\boldsymbol \theta+\boldsymbol \varepsilon})\approx \frac{1}{2} \boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) \boldsymbol \varepsilon \]

As second possibility we may also vary the first argument in the KL divergence. It is straightforward to show that this leads to the same approximation to second order in \(\boldsymbol \varepsilon\): \[ \begin{split} D_{\text{KL}}(F_{\boldsymbol \theta+\boldsymbol \varepsilon}, F_{\boldsymbol \theta}) &\approx \frac{1}{2}\boldsymbol \varepsilon^T \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\, \boldsymbol \varepsilon\\ \end{split} \]

Computing the expected Fisher information involves no observed data, it is purely a property of the model, or more precisely of the model family indexed by \(\boldsymbol \theta\). In the next Chapter we will study a related quantity, the observed Fisher information that in contrast is a function of the observed data.

2.3.2 Examples

Example 2.11 Expected Fisher information for the Bernoulli distribution:

The log-probability mass function of the Bernoulli \(\text{Ber}(\theta)\) distribution is \[ \log p(x | \theta) = x \log(\theta) + (1-x) \log(1-\theta) \] where \(\theta\) is the probability of “success”. The second derivative with regard to the parameter \(\theta\) is \[ \frac{d^2}{d\theta^2} \log p(x | \theta) = -\frac{x}{\theta^2}- \frac{1-x}{(1-\theta)^2} \] Since \(\text{E}(x) = \theta\) we get as Fisher information \[ \begin{split} I^{\text{Fisher}}(\theta) & = -\text{E}\left(\frac{d^2}{d\theta^2} \log p(x | \theta) \right)\\ &= \frac{\theta}{\theta^2}+ \frac{1-\theta}{(1-\theta)^2} \\ &= \frac{1}{\theta(1-\theta)}\\ \end{split} \]

Example 2.12 Quadratic approximations of the KL divergence between two Bernoulli distributions:

From Example 2.8 we have as KL divergence \[ D_{\text{KL}}\left (\text{Ber}(\theta_1), \text{Ber}(\theta_2) \right)=\theta_1 \log\left( \frac{\theta_1}{\theta_2}\right) + (1-\theta_1) \log\left(\frac{1-\theta_1}{1-\theta_2}\right) \] and from Example 2.11 the corresponding expected Fisher information.

The quadratic approximation implies that \[ D_{\text{KL}}\left( \text{Ber}(\theta), \text{Ber}(\theta + \varepsilon) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \] and also that \[ D_{\text{KL}}\left( \text{Ber}(\theta+\varepsilon), \text{Ber}(\theta) \right) \approx \frac{\varepsilon^2}{2} I^{\text{Fisher}}(\theta) = \frac{\varepsilon^2}{2 \theta (1-\theta)} \]

In Worksheet E1 this is verified by using a second order Taylor series applied to the KL divergence.

Example 2.13 Expected Fisher information for the normal distribution \(N(\mu, \sigma^2)\).

The log-density is \[ \log f(x | \mu, \sigma^2) = -\frac{1}{2} \log(\sigma^2) -\frac{1}{2 \sigma^2} (x-\mu)^2 - \frac{1}{2}\log(2 \pi) \] The gradient with respect to \(\mu\) and \(\sigma^2\) (!) is the vector \[ \nabla \log f(x | \mu, \sigma^2) = \begin{pmatrix} \frac{1}{\sigma^2} (x-\mu) \\ - \frac{1}{2 \sigma^2} + \frac{1}{2 \sigma^4} (x- \mu)^2 \\ \end{pmatrix} \] Hint for calculating the gradient: replace \(\sigma^2\) by \(v\) and then take the partial derivative with regard to \(v\), then substitute back.

The corresponding Hessian matrix is \[ \nabla \nabla^T \log f(x | \mu, \sigma^2) = \begin{pmatrix} -\frac{1}{\sigma^2} & -\frac{1}{\sigma^4} (x-\mu)\\ -\frac{1}{\sigma^4} (x-\mu) & \frac{1}{2\sigma^4} - \frac{1}{\sigma^6}(x- \mu)^2 \\ \end{pmatrix} \] As \(\text{E}(x) = \mu\) we have \(\text{E}(x-\mu) =0\). Furthermore, with \(\text{E}( (x-\mu)^2 ) =\sigma^2\) we see that \(\text{E}\left(\frac{1}{\sigma^6}(x- \mu)^2\right) = \frac{1}{\sigma^4}\). Therefore the expected Fisher information matrix as the negative expected Hessian matrix is \[ \boldsymbol I^{\text{Fisher}}\left(\mu,\sigma^2\right) = \begin{pmatrix} \frac{1}{\sigma^2} & 0 \\ 0 & \frac{1}{2\sigma^4} \end{pmatrix} \]

Example 2.14 Expected Fisher information for a set of independent and identically distributed random variables.

Assume that a random variable \(x \sim F_{\boldsymbol \theta}\) has log-density \(\log f(x| \boldsymbol \theta)\) and expected Fisher information \(\boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\). The expected Fisher information \(\boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta)\) for a set of iid random variables \(x_1, \ldots, x_n \sim F_{\boldsymbol \theta}\) is computed from the joint log-density \(\log f(x_1, \ldots, x_n) = \sum_{i}^n \log f(x_i| \boldsymbol \theta)\). This yields \(\boldsymbol I_{x_1, \ldots, x_n}^{\text{Fisher}}(\boldsymbol \theta) = -\text{E}_{F_{\boldsymbol \theta}} \nabla \nabla^T \sum_{i}^n \log f(x_i| \boldsymbol \theta) = \sum_{i}^n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta) = n \boldsymbol I^{\text{Fisher}}(\boldsymbol \theta)\).

2.4 Entropy learning and maximum likelihood

2.4.1 The relative entropy between true model and approximating model

Assume we have observations \(D = \{x_1, \ldots, x_n\}\). The data is sampled from \(F\), the true but unknown data generating distribution. We also specify a family of distributions \(G_{\boldsymbol \theta}\) indexed by \(\boldsymbol \theta\) to approximate \(F\).

The relative entropy \(D_{\text{KL}}(F,G_{\boldsymbol \theta})\) then measures the divergence of the approximation \(G_{\boldsymbol \theta}\) from the unknown true model \(F\). It can be written as: \[ \begin{split} D_{\text{KL}}(F,G_{\boldsymbol \theta}) &= H(F,G_{\boldsymbol \theta}) - H(F) \\ &= \underbrace{- \text{E}_{F}\log g_{\boldsymbol \theta}(x)}_{\text{cross-entropy}} -(\underbrace{-\text{E}_{F}\log f(x)}_{\text{entropy of $F$, does not depend on $\boldsymbol \theta$}})\\ \end{split} \]

However, since we do not know \(F\) we cannot actually compute this divergence. Nonetheless, we may use the empirical distribution \(\hat{F}_n\) — a function of the observed data — as approximation for \(F\), and in this way we arrive at an approximation for \(D_{\text{KL}}(F,G_{\boldsymbol \theta})\) that becomes more and more accurate with growing sample size.


Recall the “Law of Large Numbers” :

  • By the strong law of large numbers the empirical distribution \(\hat{F}_n\) based on observed data \(D=\{x_1, \ldots, x_n\}\) converges to the true underlying distribution \(F\) as \(n \rightarrow \infty\) almost surely: \[ \hat{F}_n\overset{a. s.}{\to} F \]

  • For \(n \rightarrow \infty\) the average \(\text{E}_{\hat{F}_n}(h(x)) = \frac{1}{n} \sum_{i=1}^n h(x_i)\) converges to the expectation \(\text{E}_{F}(h(x))\).


Hence, for large sample size \(n\) we can approximate cross-entropy and as a result the KL divergence. The cross-entropy \(H(F, G_{\boldsymbol \theta})\) is approximated by the empirical cross-entropy where the expectation is taken with regard to \(\hat{F}_n\) rather than \(F\): \[ \begin{split} H(F, G_{\boldsymbol \theta}) & \approx H(\hat{F}_n, G_{\boldsymbol \theta}) \\ & = - \text{E}_{\hat{F}_n} (\log g(x|\boldsymbol \theta)) \\ & = -\frac{1}{n} \sum_{i=1}^n \log g(x_i | \boldsymbol \theta) \\ & = -\frac{1}{n} l_n ({\boldsymbol \theta}| D) \end{split} \] This turns out to be equal to the negative log-likelihood standardised by the sample size \(n\)! Or in other words, the log-likelihood is the negative empirical cross-entropy multiplied by sample size \(n\).

From the link of the multinomial coefficient with Shannon entropy (Example 2.3) we already know that for large sample size \[ H(\hat{F}) \approx \frac{1}{n} \log \binom{n}{n_1, \ldots, n_K} \]

The KL divergence \(D_{\text{KL}}(F,G_{\boldsymbol \theta})\) can therefore be approximated by \[ D_{\text{KL}}(F,G_{\boldsymbol \theta}) \approx -\frac{1}{n} \left( \log \binom{n}{n_1, \ldots, n_K} + l_n ({\boldsymbol \theta}| D) \right) \] Thus, with the KL divergence we obtain not just the log-likelihood (the cross-entropy part) but also the multiplicity factor taking account of the possible orderings of the data (the entropy part).

2.4.2 Minimum KL divergence and maximum likelihood

If we knew \(F\) we would simply minimise \(D_{\text{KL}}(F, G_{\boldsymbol \theta})\) to find the particular model \(G_{\boldsymbol \theta}\) that is closest to the true model. Equivalently, we would minimise the cross-entropy \(H(F, G_{\boldsymbol \theta})\). However, since we actually don’t know \(F\) this is not possible.

However, for large sample size \(n\) when the empirical distribution \(\hat{F}_n\) is a good approximation for \(F\), we can use the results from the previous section. Thus, instead of minimising the KL divergence \(D_{\text{KL}}(F, G_{\boldsymbol \theta})\) we simply minimise \(H(\hat{F}_n, G_{\boldsymbol \theta})\) which is the same as maximising the log-likelihood \(l_n ({\boldsymbol \theta}| D)\).

Conversely, this implies that maximising the likelihood with regard to the \(\boldsymbol \theta\) is equivalent ( asymptotically for large \(n\)) to minimising the KL divergence of the approximating model and the unknown true model!

\[ \begin{split} \hat{\boldsymbol \theta}^{ML} &= \underset{\boldsymbol \theta}{\arg \max}\,\, l_n(\boldsymbol \theta| D) \\ &= \underset{\boldsymbol \theta}{\arg \min}\,\, H(\hat{F}_n, G_{\boldsymbol \theta}) \\ &\approx \underset{\boldsymbol \theta}{\arg \min}\,\, D_{\text{KL}}(F, G_{\boldsymbol \theta}) \\ \end{split} \]

Therefore, the reasoning behind the method of maximum likelihood is that it minimises a large sample approximation of the KL divergence of the candidate model \(G_{\boldsymbol \theta}\) from the unkown true model \(F\).

As a consequence of the close link of maximum likelihood and relative entropy maximum likelihood inherits for large \(n\) (and only then!) all the optimality properties from KL divergence. These will be discussed in more detail later in the course.

2.4.3 Further connections

Since minimising KL divergence contains ML estimation as special case you may wonder whether there is a broader justification of relative entropy in the context of statistical data analysis?

Indeed, KL divergence has strong geometrical interpretation that forms the basis of information geometry. In this field the manifold of distributions is studied using tools from differential geometry. The expected Fisher information plays an important role as metric tensor in the space of distributions.

Furthermore, it is also linked to probabilistic forecasting. In the framework of so-called scoring rules. the only local proper scoring rule is the negative log-probability (“surprise”). The expected “surprise” is the cross-entropy and relative entropy is the corresponding natural divergence connected with the log scoring rule.

Furthermore, another intriguing property of KL divergence is that the relative entropy \(D_{\text{KL}}(F, G)\) is the only divergence measure that is both a Bregman and an \(f\)-divergence. Note that \(f\)-divergences and Bregman-divergences (in turn related to proper scoring rules) are two large classes of measures of similarity and divergence between two probability distributions.

Finally, not only the likelihood estimation but also the Bayesian update rule (as discussed later in this module) is another special case of entropy learning.