3 Entropy
Entropy, a fundamental concept that originated in physics, plays a crucial role in information theory and statistical learning. This chapter introduces entropy via the route of logarithmic scoring rules.
3.1 Shannon entropy and differential entropy
The logarithm and units of information storage
Assume we observe a discrete variable \(x\) with \(K\) possible states \(\Omega = \{\omega_1, \ldots, \omega_K\}\). Furthermore, we have a number of identical memory units each of which can record \(A\) different states. Following the principle of common numeral systems, such as the decimal, binary or hexadecimal numbers, we see that using \(S = \log_A K\) such information memory units is sufficient to describe the state of \(x\).
The storage requirement \(S\) is the code length or the cost needed to describe the state of \(x\) using an alphabet of size \(A\).
In the above we have tacitly assumed that all \(K\) states are treated equally so that the storage size, code length, and cost requirement associated with each state is constant and the same for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log_A \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.
Example 3.1 Information storage units:
For \(A=2\) the storage units are called “bits” (binary information units), and a single bit can store 2 states. Hence to describe the \(K=256\) possible states in a system \(8=\log_2 256\) bits (or 1 byte) of storage are sufficient.
For \(A=10\) the units are “dits” (decimal information units), so to describe \(K=100\) possible states \(2=\log_{10} 100\) dits are sufficient, where a single dit can store 10 states.
Finally, if the natural logarithm is used (\(A=e\)) the storage units are called “nits” (natural information units). In the following we will use “nits” and natural logarithm throughout.
Surprise or surprisal and logarithmic scoring rule
In practise, the \(K\) states may not all be equally probably, and assume there is a discrete distribution \(P\) with probability mass function \(p(x)\) to model the state probabilities. In this case, instead of using the same code length to describe each state, we may use variable code lengths, with more probable states assigned shorter codes and less probable states having longer codes. More specifically, generalising from the previous we may use the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] As we will see below (Example 3.8, Example 3.9 and Example 3.10) using logarithmic cost allows for expected code lengths that are potentially much smaller than the fixed length \(\log K\), and hence leads to a more space saving representation.
The negative logarithm of the probability \(p(x)\) of an event \(x\) is known as the surprise or surprisal. The surprise to observe a certain event (with \(p(x)=1\)) is zero, and conversely the surprise to observe an event that is certain not to happen (with \(p(x)=0\)) is infinite.
We will apply \(S(x, P)\) to both discrete and continuous variables \(x\) and corresponding distributions \(P\) and then call it logarithmic score or logarithmic scoring rule (see also Example 3.4). As densities can take on values larger than 1 the logarithmic score \(S(x, P)\) may therefore become negative when \(P\) is a continuous distribution.
Example 3.2 Log-odds ratio and surprise:
The log-odds ratio of the probability \(p\) of an event is the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event:
\[ \begin{split} \text{logit}(p) &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \]
Example 3.3 Logarithmic score and normal distribution:
If we quote in the logarithmic scoring rule the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get as score \[ S\left(x,N(\mu, \sigma^2 )\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For fixed variance \(\sigma^2\) this is equivalent to the squared error from the parameter \(\mu\).
Example 3.4 \(\color{Red} \blacktriangleright\) General scoring rules:
The function \(S(x, P) = -\log p(x)\) is an important example of a scoring rule for a probabilistic forecast represented by model \(P\) evaluated on the observation \(x\)1.
While one can devise many different scoring rules the logarithmic scoring rule stands out as it has a number of unique and favourable properties (e.g. Hartley 1928, Shannon 1948, Good 1952, Bernardo 1979). In particular, it is the only scoring rule that is both proper, i.e. the expected score is minimised when the quoted model \(P\) is identical to the data generating model, and local in that the score depends only on the value of the density/probability mass function at \(x\).
Entropy of a distribution
The entropy of the distribution \(P\) is defined as the functional \[ \begin{split} H(P) &= \text{E}_P\left( S(x, P) \right) \\ &= - \text{E}_P\left(\log p(x)\right) \\ \end{split} \] i.e. as the expected logarithmic score when the data are generated by \(P\) and the model \(P\) is evaluated on the observations.
As will become clear, entropy quantifies the spread of the probability mass within a distribution. When the probability mass is concentrated in a specific area, the entropy will be low; conversely, when the probability mass is more widely distributed, the entropy will be high.
The entropy of a discrete probability distribution \(P\) with probability mass function \(p(x)\) with \(x \in \Omega\) is called Shannon entropy (1948)2. In statistical physics, the Shannon entropy is known as Gibbs entropy (1878):
\[ H(P) = - \sum_{x \in \Omega} \log p(x) \, p(x) \] The entropy of a discrete distribution is the expected surprise. We can also interpret it as the expected cost or expected code length when the data are generated according to model \(P\) and we are also using model \(P\) to describe the data. Furthermore, it also has a combinatorial interpretation (see Example 3.12).
As \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\) by construction Shannon entropy is bounded below and must be larger or equal to 0.
Applying the definition of entropy to a continuous probability distribution \(P\) with density \(p(x)\) yields the differential entropy: \[ H(P) = - \int_x \log p(x) \, p(x) \, dx \] Because the logarithm is taken of a density, which in contrast to a probability can assume values larger than one, differential entropy can be negative.
Furthermore, since for continuous random variables the shape of the density typically changes under variable transformation, say from \(x\) to \(y\), the differential entropy will change as well under such a transformation so that \(H(P_y) \neq H(P_x)\).
3.2 Entropy examples
Models with single parameter
Example 3.5 The Shannon entropy of the geometric distribution \(F_x = \text{Geom}(\theta)\) with probability mass function \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\text{E}(x)= 1/\theta\) is \[ \begin{split} H(F_x) &= - \text{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log(0)=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.
Example 3.6 Consider the uniform distribution \(F_x = U(0, a)\) with \(a>0\), support from \(0\) to \(a\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( F_x ) &= - \int_0^a \log\left(\frac{1}{a}\right) \, \frac{1}{a} dx \\ &= \log a \int_0^a \frac{1}{a} dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.
Example 3.7 Starting with the uniform distribution \(F_x = U(0, a)\) from Example 3.6 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(F_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).
The corresponding differential entropy is \[ \begin{split} H( F_y ) &= \int_0^{a^2} \log \left(2 a \sqrt{y}\right) \, 1/\left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( F_y ) \neq H( F_x )\) as differential entropy is not invariant against variable transformations.
Models with multiple parameters
Example 3.8 The Shannon entropy of the categorical distribution \(P\) with \(K\) categories with class probabilities \(p_1, \ldots, p_K\) is \[ H(P) = - \sum_{k=1}^{K } \log(p_k)\, p_k \]
As \(P\) is discrete \(H(P)\) is bounded below by 0. Furthermore, it is also bounded above by \(\log K\) (cf. Example 6.1). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \] The maximum is achieved for the discrete uniform distribution (Example 3.9) and the minimum for a concentrated categorical distribution (Example 3.10).
Example 3.9 Entropy for the discrete uniform distribution \(U_K\):
Let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\log\left(\frac{1}{K}\right)\, \frac{1}{K} = \log K\]
Note that \(\log K\) is the largest value the Shannon entropy can assume with \(K\) classes (cf. Example 6.1) and indicates maximum spread of probability mass.
Example 3.10 Entropy for a categorical distribution with concentrated probability mass:
Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon entropy \[H(P) = \log(1)\times 1 + \log(0)\times 0 + \dots = 0\]
Note that 0 is the smallest value that Shannon entropy can assume and that it corresponds to maximum concentration of probability mass.
Example 3.11 Differential entropy of the normal distribution:
The log density of the univariate normal \(N(\mu, \sigma^2)\) distribution is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Note that \(H(P)\) only depends on the variance parameter and not on the mean parameter. This intuitively clear as only the variance controls the concentration of the probability mass. The entropy grows with the variance as the probability mass becomes more spread out and less concentrated around the mean. For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.
Example 3.12 \(\color{Red} \blacktriangleright\) Entropy of a categorical distribution and the multinomial coefficient:
Let \(\hat{Q}\) be the empirical categorical distribution with \(\hat{q}_k = n_k/n\) the observed frequencies with \(n_k\) counts in class \(k\) and \(n=\sum_{k=1}^K\) total counts.
The number of possible permutation of \(n\) items of \(K\) distinct types is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \]
It turns out that for large \(n\) both quantities are directly linked: \[ H(\hat{Q}) \approx \frac{1}{n} \log W \]
Recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \] With this \[ \begin{split} \log W &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \log (\hat{q}_k) \, \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]
The above combinatorial derivation of entropy is one of the cornerstones of statistical mechanics and is credited to Boltzmann (1877) and Gibbs (1878). The number of elements \(n_1, \ldots, n_K\) in each of the \(K\) classes corresponds to the macrostate and any of the \(W\) different allocations of the \(n\) elements to the \(K\) classes to an underlying microstate. The multinomial coefficient, and hence entropy, is largest when there are only small differences (or none) among the \(n_i\), i.e. when samples are equally spread across the \(K\) bins.
In statistics the above derivation of entropy was rediscovered by Wallis (1962).
A bit of history
The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in the context of thermodynamics. In physics entropy measures the distribution of energy: if energy is concentrated then the entropy is low, and conversely if energy is spread out the entropy is large. The total energy is conserved (first law of thermodynamics) but with time it will diffuse and thus entropy will increase with time (second law of thermodynamics).
The modern probabilistic definition of entropy was discovered in the 1870s by Ludwig Boltzmann (1844–1906) and Josiah W. Gibbs (1839–1903). In statistical mechanics entropy is proportional to the logarithm of the number of microstates (i.e. particular configurations of the system) compatible with the observed macrostate. Typically, in systems where the energy is spread out there are very large numbers of compatible configurations hence this corresponds to large entropy, and conversely, if the energy is concentrated there are only few such configurations, and thus is corresponds to low entropy.
In the 1940–1950’s the notion of entropy turned out to be central also in information theory, a field pioneered by mathematicians such as Ralph Hartley (1888–1970), Solomon Kullback (1907–1994), Alan Turing (1912–1954), Richard Leibler (1914–2003), Irving J. Good (1916–2009), Claude Shannon (1916–2001), and Edwin T. Jaynes (1922–1998), and later further explored by Shun’ichi Amari (1936–), Imre Ciszár (1938–), Bradley Efron (1938–), Philip Dawid (1946–) and many others.
Of the above, Turing and Good were affiliated with the University of Manchester.
We use the convention that scoring rules are negatively oriented (e.g. Dawid 2007) with the aim to minimise the score (cost, code length, surprise). However, some authors prefer the positively oriented convention with a reversed sign in the definition of \(S(x, P)\) so the score represents a reward that is maximised (e.g. Gneiting and Raftery 2007).↩︎
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x↩︎