6 Principle of maximum entropy

6.1 \(\color{Red} \blacktriangleright\) Maximum entropy principle to characterise distributions

Both Shannon entropy and differential entropy are useful to characterise distributions.

As discussed in Chapter 3, large entropy implies that the distribution is spread out whereas small entropy indicates that the distribution is concentrated.

Correspondingly, maximum entropy distributions can be considered minimally informative about a random variable. The higher the entropy the more spread out (and hence more uninformative) the distribution. Conversely, low entropy implies that the probability mass is concentrated and thus the distribution is more informative about the random variable.

Examples:

The discrete uniform distribution is the maximum entropy distribution among all discrete distributions.
the maximum entropy distribution of a continuous random variable with support \([-\infty, \infty]\) with a specific mean and variance is the normal distribution.
the maximum entropy distribution among all continuous distributions supported in \([0, \infty]\) with a specified mean is the exponential distribution.

Using maximum entropy to characterise maximally uninformative distributions was advocated by Edwin T. Jaynes (1922–1998) (who also proposed to use maximum entropy in the context of finding Bayesian priors). The maximum entropy principle in statistical physics goes back to Boltzmann.

A list of maximum entropy distribution is given here: https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution.

Many distributions commonly used in statistical modelling are exponential families. Intriguingly, these distribution are all maximum entropy distributions, so there is a very close link between the principle of maximum entropy and common model choices in statistics and machine learning.

Example 6.1 \(\color{Red} \blacktriangleright\) Discrete uniform distribution as maximum entropy distribution:

Assume \(G\) is an categorical distribution with \(K\) classes and probabilities \(g_i\). We now show that \(G\) has maximum entropy when \(G\) is the discrete uniform distribution.

Let \(P=U_K\) be the discrete uniform with equal probabilities \(p_i=1/K\). The entropy of \(P\) is \(H(P) = \log K\) (see also Example 3.9). The cross-entropy \(H(G, P) = - \text{E}_G \log p_i = \log K\). Note that both entropies are identical.

From Gibbs’ inequality we know that \(H(G, P) \geq H(G)\). Since in our case \(H(G, P) = H(P)\) it follows directly that \(H(P) \geq H(G)\), i.e. the discrete uniform distribution \(P\) achieves the maximum entropy. Furthermore, any other distribution \(G\) will have lower entropy, with equality only if \(G=P\) and thus only if \(g_i=1/K\).

Example 6.2 \(\color{Red} \blacktriangleright\) Exponential distribution as maximum entropy distribution:

Assume \(G\) is an continous distribution for \(x\) with support \([0, \infty]\) and with specified mean \(\text{E}(x) = \theta\). We now show that \(G\) has maximum entropy when \(G\) is the exponential distribution.

The log-density of the exponential distribution \(P\) with scale parameter \(\theta\) is \(\log p(x | \theta) = x/\theta -\log \theta\). The differential entropy of \(P\) is \(H(P) = -\text{E}_P \log p(x | \theta) = 1 +\log \theta\) as \(\text{E}_P(x) = \theta\). The cross-entropy \(H(G, P) = -\text{E}_G \log p(x | \theta) = 1 +\log \theta\) as \(\text{E}_G(x) = \theta\). Note that both entropies are identical.

From Gibbs’ inequality we know that \(H(G, P) \geq H(G)\). Since in our case \(H(G, P) = H(P)\) it follows directly that \(H(P) \geq H(G)\), i.e. the exponential distribution \(P\) achieves the maximum entropy. Furthermore, any other distribution \(G\) will have lower entropy, with equality only if \(G=P\).