6  Maximum entropy

This chapter introduces information entropy with prior measure, equal to the negative KL divergence and closely linked to Boltzmann entropy. Subsequently, the principle of maximum entropy is used to characterise distributions, naturally leading to exponential families.

6.1 Information entropy with prior measure

Entropy of a distribution with regard to a reference

The information entropy with prior measure of the distribution \(Q\) with reference to the distribution \(P\) is given by \[ \begin{split} G(Q, P) &= H(Q) - H(Q, P) \\ &= -\operatorname{E}_Q\log\left(\frac{q(x)}{p(x)}\right) \\ &= - D_{\text{KL}}(Q, P) \end{split} \] Note that this quantity equals the negative of the KL divergence between \(Q\) and \(P\) (Section 4.2).

We denote information entropy with prior measure by \(G(Q,P)\), not to be confused with the mean log-loss \(H(Q, P)\).

Standard information entropy \(H(Q)\) (see Section 3.2) is a special case of information entropy with prior measure \(G(Q, P)\) when the reference distribution \(P\) is uniform (see Example 6.1).

Standard information entropy \(H(Q)\) can be derived by combinatorial arguments (see Example 3.11). Information entropy with prior measure \(G(Q,P)\) is a natural generalisation of standard information entropy \(H(Q)\) and follows a related derivation (see Example 6.2).

Example 6.1 Information entropy with uniform prior measure:

Assume a distribution \(Q\) and the uniform distribution \(U\) with constant pdmf \(u(x)=c\) over the domain of \(Q\).

The mean log-loss \(H(Q, U)\) is \[ H(Q, U) = -\operatorname{E}_Q(\log u(x)) = -\log c = \text{const.} \] and does not depend on \(Q\).

The information entropy of \(Q\) with reference to \(U\) is therefore \[ G(Q, U) = H(Q) - H(Q, U) = H(Q) + \text{const.} \] Thus, the standard information entropy \(H(Q)\) is equivalent to information entropy \(G(Q,U)\) if the reference measure is the uniform distribution.

Note 6.1: Other names for information entropy with prior measure

Information entropy with regard to a reference distribution is known in physics as generalised Boltzmann-Gibbs-Shannon entropy (generalised BGS entropy).

Occasionally, it is called relative BGS entropy but that is ambiguous because relative entropy commonly refers to KL divergence.

Many other proposals for “generalised” entropies exist in the literature. Hence it is recommended to use “information entropy with respect to a prior measure” for \(G(Q, P)\) rather than “relative” or “generalised” entropy.

Properties

Information entropy with prior measure shares its properties with KL divergence since the only difference is the opposite sign.

As a result, it is always non-positive, \(G(Q, P) \leq 0\), with the maximum \(G(Q, P) = 0\) achieved only for \(Q=P\).

In addition, the negative sign compared to KL divergence means that information entropy with prior measure \(G(Q, P)\) is strictly concave in \(Q\), exactly like information entropy \(H(Q)\). This ensures that optimisation with regard to \(Q\) yields a unique maximum (it one exists).

Furthermore, information entropy with prior measure is invariant under an invertible change of variables from \(x\) to \(y\) so that \[ G(Q_y, P_y) = G(Q_x, P_x) \] It also obeys the data-processing inequality with \[ G(Q_y,P_y) \geq G(Q_x,P_x) \] so that the entropy is non-decreasing under a data-processing map from \(x\) to \(y\) (with identity for lossless transformations such as a change of variables).

Interpretation

\(G(Q, P)\) measures the spread of probability mass relative to a reference distribution \(P\).

This generalises from information entropy \(H(Q)\) which measures the spread of probability mass relative to the uniform distribution.

\(\color{Red} \blacktriangleright\) Entropy as log-probability

Boltzmann’s key insight into the foundations of statistical mechanics was that entropy is a logarithmic measure of the “size” or “volume” of a macrostate (see Section 3.3).

Taking this volume to be proportional to the number of microstates compatible with the observed macrostate yields information entropy (see Example 3.11). Recalling that probability is a normalised positive measure, a natural generalisation is to measure the volume of a macrostate by its probability. This leads to information entropy with prior measure (Example 6.2) and the general notion of entropy as log-probability.

Specifically, for a macrostate represented by counts \(D=\{n_1, \ldots, n_K\}\) for \(K\) classes with \(n = \sum_{k=1}^K n_k\), or equivalently by \(\hat{Q} = \operatorname{Cat}(\hat{\boldsymbol q})\) with \(\hat{q}_k = n_k/n\), the logarithm of the multiplicity \(W_K\) yields, for large \(n\), the information entropy \[ \frac{1}{n} \log W_K \approx H(\hat{Q}) \] If instead we use the logarithm of the probability of the macrostate the above relation leads to information entropy with prior measure \(P\) with \[ \frac{1}{n} \log \operatorname{Pr}(D| \,\boldsymbol p) \approx G(\hat{Q}, P) \] where \(\boldsymbol p= (p_1, \ldots, p_K)^T\) are prior probabilities of each class with \(P = \operatorname{Cat}(\boldsymbol p)\). See Example 6.2 for details.

The above link between probability and entropy written in terms of KL divergence as \[ \operatorname{Pr}(D| \,\boldsymbol p) \approx e^{-n D_{\text{KL}}(\hat{Q}, P)} \] forms the basis of large deviations theory (in probability) and the method of types (in information theory).

Example 6.2 \(\color{Red} \blacktriangleright\) Combinatorial derivation of information entropy with prior measure:

This is an extension of the combinatorial derivation of the standard information entropy (Example 3.11).

Assume \(\hat{Q}=\operatorname{Cat}(\hat{\boldsymbol q})\) is an empirical categorical distribution based on observed counts \(D = \{n_1, \ldots, n_K\}\), representing the macrostate, and \(P=\operatorname{Cat}(\boldsymbol p)\) is a second categorical distribution with class probabilities \(\boldsymbol p=(p_1, \ldots, p_K)^T\). For large \(n\) we may use the multinomial coefficient \(W_K = \binom{n}{n_1, \ldots, n_K}\) to obtain the entropy of \(\hat{Q}\) via \(\log W_K \approx n H(\hat{Q})\) (Example 3.11).

The standardised multinomial log-probability of \(D\) given model \(p\) then yields \[ \begin{split} \frac{1}{n} \log p(D| \,\boldsymbol p) &= \frac{1}{n} \log \left( W_K \times \prod_{k=1}^K p_k^{n_k} \right)\\ & = \frac{1}{n} \left( \log W_K + \sum_{k=1}^K n_k \log p_k \right)\\ & \approx H(\hat{Q}) + \frac{1}{n} \sum_{k=1}^K n_k \log p_k \\ & = H(\hat{Q}) + \sum_{k=1}^K \hat{q}_k \log p_k \\ &=H(\hat{Q}) -H(\hat{Q}, P)\\ &= G(\hat{Q}, P)\\ \end{split} \] and hence \[ \frac{1}{n} \log p(D| \,\boldsymbol p) \approx G(\hat{Q}, P) \]

Standard information entropy \(H(\hat{Q})\) implicitly assumes that every possible microstate, regardless of its associated macrostate, has the same probability (equal to \(1/K^n\)) and that the probabilities of each class are equal (\(p_k= 1/K\)). In contrast, information entropy \(G(\hat{Q}, P)\) with prior measure \(P\) only assumes that microstates belonging to the same macrostate have identical probability (equal to \(\prod_{k=1}^K p_k^{n_k}\)) and also allows for unequal class probabilities \(p_k\).

6.2 Maximum entropy principle to characterise distributions

Rationale of maximum entropy

In physics, systems at equilibrium are characterised by maximum entropy, subject to constraints. For example, in statistical mechanics maximum entropy characterised the most typical macrostate (see Section 3.3).

As both entropy with prior measure \(G(Q, P)\) and standard entropy \(H(Q)\) are strictly concave in \(Q\) (with the reference \(P\) fixed or implicitly set to the uniform distribution), the maximum-entropy distribution is unique (it is exists).

As discussed in Chapter 3 and Section 6.1, large entropy implies that the distribution is spread out (relative to the reference \(P\)). Correspondingly, maximum-entropy distributions can be considered minimally informative about a random variable.

Intriguingly, many distributions (and distribution families) commonly used in statistical modelling are characterised by maximum entropy. For example:

  1. The discrete uniform distribution is the unique maximum-entropy distribution among all categorical distributions. See Example 6.3.

  2. The exponential distribution is the unique maximum-entropy distribution among all continuous distributions supported in \([0, \infty]\) with a specified mean. See Example 6.4.

  3. The normal distribution is the unique maximum-entropy distribution among all continuous distributions supported in \([-\infty, \infty]\) with a specified mean and variance.

  4. More generally, exponential families also follow from the principle of maximum entropy subject to specified expectation parameters. See Example 6.5.

Therefore, there is a very close link between the principle of maximum entropy and widely used choices for models in statistics and machine learning.

Examples

Example 6.3 Discrete uniform distribution as maximum-entropy distribution:

Let \(Q=\operatorname{Cat}(\boldsymbol q)\) be a general categorical distribution with \(K\) classes and corresponding class probabilities \(\boldsymbol q=(q_1, \ldots, q_K)^T\) and \(P=\operatorname{Cat}(\boldsymbol p)\) be the discrete uniform distribution with equal class probabilities \(\boldsymbol p=(p_1=1/K, \ldots, p_K=1/K)\).

We now show that the entropy \(H(Q)\) is maximised uniquely for \(Q=P\), i.e. if \(Q\) is the discrete uniform distribution.

First, note that \(H(Q)\) is equivalent to \(G(Q, P)\) since \(P\) is the uniform distribution (Example 6.1). Hence, maximising \(H(Q)\) is equivalent to maximising \(G(Q, P)\) with regard to \(Q\). The entropy \(G(Q, P)=0\) achieves its maximum uniquely for \(Q=P\), thus \(H(Q)\) is also maximised uniquely for \(Q=P\).

Alternatively, we can also show directly that \(H(P) \geq H(Q)\) with equality only for \(Q=P\), i.e. that the entropy of \(Q\) is always smaller than the entropy of \(P\) unless \(Q=P\). The entropy of the discrete uniform distribution \(P\) is \(H(P) = \log K\) (see Example 3.4). The mean log-loss between \(Q\) and \(P\) is \(H(Q, P) = - \sum_{k=1}^K q_k \log p_k = \log K\). Gibbs’ inequality states that \(H(Q, P) \geq H(Q)\), with equality only if \(P=Q\). Since in our case \(H(Q, P) = H(P) = \log K\) it follows directly that \(H(P) \geq H(Q)\).

The uniqueness of the maximum follows from Gibbs’ inequality and from the strict concavity of information entropy.

Example 6.4 Exponential distribution as maximum-entropy distribution:

Assume \(Q\) is a continuous distribution for \(x\) with support \([0, \infty]\) and with specified mean \(\operatorname{E}_Q(x) = \theta\). Furthermore, \(P=\operatorname{Exp}(\theta)\) is an exponential distribution with scale parameter \(\theta\) and log-density \(\log p(x | \theta) = x/\theta -\log \theta\) and \(\operatorname{E}_P(x) = \theta\).

We now show that the entropy \(H(Q)\) is maximised uniquely for \(Q=P\), i.e. if \(Q\) is an exponential distribution with specified mean.

The information entropy of \(P\) is \(H(P) = -\operatorname{E}_P( \log p(x | \theta) ) = 1 +\log \theta\) as \(\operatorname{E}_P(x) = \theta\). The mean log-loss is \(H(Q, P) = -\operatorname{E}_Q (\log p(x | \theta)) = 1 +\log \theta\) as \(\operatorname{E}_Q(x) = \theta\). Gibbs’ inequality states that \(H(Q, P) \geq H(Q)\). Since in our case \(H(Q, P) = H(P)\) it follows directly that \(H(P) \geq H(Q)\).

Therefore, the exponential distribution \(P\) achieves maximum entropy. and the distribution \(Q\) will have lower entropy than \(P\), unless \(Q=P\). The maximum is unique because of Gibb’s inequality and the strict concavity of information entropy.

Example 6.5 \(\color{Red} \blacktriangleright\) Exponential families as maximum-entropy distributions:

Exponential families (see Section 2.4) are also characterised by achieving maximum entropy with specified expectation parameters.

Specifically, assume \(B\) is a base distribution with pdmf \(b(x)\) and \(P= P(\boldsymbol \eta)\) is an exponential family constructed by exponential tilting \(B\) with log-pdmf \(p(x| \boldsymbol \eta) = \langle \boldsymbol \eta, \boldsymbol t(x) \rangle + \log b(x) - a(\boldsymbol \eta)\). The expectation parameters are specified as \(\operatorname{E}_P(\boldsymbol t(x)) = \boldsymbol \mu_{\boldsymbol t}\) where \(\boldsymbol t(x)\) are the canonical statistics. Let \(Q\) be a distribution on the same domain as \(B\) and \(P\) subject to the constraint that \(\operatorname{E}_Q(\boldsymbol t(x)) = \boldsymbol \mu_{\boldsymbol t}\).

We now show that the entropy \(G(Q, B)\) is maximised uniquely for \(Q=P\), i.e. if \(Q\) is an exponential family constructed from \(B\) with the specified expectation parameters.

The entropy of \(P\) is \(H(P) = -\langle \boldsymbol \eta, \boldsymbol \mu_{\boldsymbol t} \rangle + H(P, B) + a(\boldsymbol \eta)\). The mean log-loss between \(Q\) and \(P\) is \(H(Q, P) = -\langle \boldsymbol \eta, \boldsymbol \mu_{\boldsymbol t} \rangle + H(Q, B) + a(\boldsymbol \eta)\). Gibbs’ inequality states that \(H(Q, P) \geq H(Q)\). Since in our case \(H(Q, P) = H(P)-H(P,B)+H(Q,P)\) it follows directly that \(H(P)-H(P,B) \geq H(Q)-H(Q,P)\) and thus \(G(P, B) \geq G(Q, B)\).

Therefore, the exponential family \(P\) achieves maximum entropy (with reference measure \(B\)) and the distribution \(Q\) will have lower entropy than \(P\), unless \(Q=P\). The maximum is unique because of Gibb’s inequality and the strict concavity of information entropy.

6.3 Further reading

See Akaike (1985) for a historical account of Boltzmann entropy as log-probability and the discovery of information entropy with prior measure by Boltzmann in 1878.

Jaynes (2003) provides a broad discussion of the principle of maximum entropy as a foundation for statistical learning.

See also: Maximum-entropy probability distributions (Wikipedia)

TipA bit of history

The concept of entropy as log-probability is due to Ludwig Boltzmann (1844–1906). Boltzmann discovered the form of information entropy with prior measure in 1878, following the earlier 1877 formulation of standard information entropy.

Max Planck (1858–1947) was the first to explicitly state the fundamental relationship between entropy and log-probability (with credit to Boltzmann), using it to derive the spectral density of black-body radiation in 1900.

Maximum entropy originated in thermodynamics as an empirical characterisation of equilibrium states. Josiah W. Gibbs (1839–1903) used the principle of maximum entropy subject to constraints to explicitly derive the canonical distribution (the physics name for exponential families) to represent equilibrium states in statistical mechanics.

Solomon Kullback (1907–1994) and Richard Leibler (1914–2003) advocated the principle of minimum KL divergence in statistics.

Edwin T. Jaynes (1922–1998) unified information theory and statistical mechanics and promoted the principle of maximum entropy as a general method for statistical learning and updating uncertainty.