3 Entropy
Entropy, a fundamental concept that originated in physics, plays a crucial role in information theory and statistical learning. This chapter introduces entropy via the route of scoring rules.
3.1 Information storage and scoring rules
Information storage with constant code length
We study a discrete variable \(x\) that can assume \(K\) possible states \(\Omega = \{\omega_1, \ldots, \omega_K\}\). Our aim is to record the current state of the variable \(x\) using one or more information storage units, each of which can register one of \(A\) different states. \(A\) is called alphabet size of the unit, and can be large or smaller than \(K\).
Following the principle of common numeral systems, such as the decimal, binary or hexadecimal numbers, it is easy to verify that in order to describe the state of \(x\) we require \[ S = \log_A K \] storage units. \(S\) is called the code length or the cost to describe the state of \(x\).
In the above we have tacitly assumed that all \(K\) states are equal and that storage size, code length, and cost requirement associated with each state \(\omega_i \in \Omega\) is constant and the same for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log_A \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.
Example 3.1 Information storage units:
For and alphabet of size \(A=2\) (e.g. 0, 1) the storage units are called “bits” (binary digits). Hence to describe \(K=256\) possible states \(\log_2 256 = 8\) bits (or 1 byte) of storage are sufficient.
For \(A=10\) the units are “dits” (decimal digits), so to describe \(K=100\) possible states \(\log_{10} 100 = 2\) dits are sufficient, where a single dit has an alphabet of size \(A=10\) (e.g. arabic numerals).
Finally, if the natural logarithm is used (\(A=e\)) the storage units are called “nits” (natural digits). In the following we will use “nits” and the natural logarithm throughout.
Variable code length and scoring rules
In practise, the \(K\) states are often not equally probable. For example \(x\) might be a random random variable describing the letters in a text message. Hence, rather than assuming equal probabilities we use a discrete distribution \(P\) with probability mass function \(p(x)\) to model the state probabilities.
In this case, instead of using the same code length to describe each state, we use variable code lengths, with more probable states having shorter codes and less probable states assigned longer codes. Specifically, generalising from the previous we may wish to employ the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] This is called the logarithmic score or logarithmic scoring rule1.
As we will see below (Example 3.8, Example 3.9 and Example 3.10) using variable code lengths allows for expected code lengths that are potentially much smaller than using a fixed length \(\log K\) for all states, and hence leads to a more space saving representation.
We will apply the logarithmic score \(S(x, P) = -\log p(x)\) to both discrete and continuous variables \(x\) and corresponding distributions \(P\). As densities can take on values larger than 1 the logarithmic score may become negative when \(P\) is a continuous distribution.
While there are other possibilities for suitable scoring rules we will exclusively rely on the logarithmic scoring rule as it the default scoring rule underlying classical likelihood and Bayes inference methods due to its unique characteristics (see Note 3.1).
The function \(S(x, P) = -\log p(x)\) is an example of a scoring rule for a probabilistic forecast represented by model \(P\) evaluated on the observation \(x\).
The logarithmic scoring rule is uniquely characterised by a number of favourable properties (e.g. Hartley 1928, Shannon 1948, Good 1952, Bernardo 1979). In particular, it is the only scoring rule that is both
- proper, i.e. the expected score is minimised when the quoted model \(P\) is identical to the data generating model, and
- local in that the score depends only on the value of the probability density mass function at \(x\).
See also Example 4.2.
Example 3.2 Surprise:
The negative logarithm of a probability \(-\log p\) is called the surprise or surprisal. The surprise to observe a certain outcome (with probability \(p=1\)) is zero, and conversely the surprise to observe an outcome that cannot happen (with probability \(p=0\)) is infinite.
Example 3.3 Log-odds ratio:
The odds of an event with probability \(p\) are given by the ratio \(\frac{p}{1-p}\).
The log-odds ratio is therefore the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event (with probability \(p\)): \[ \begin{split} \text{logit}(p) &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \]
Example 3.4 Logarithmic score and normal distribution:
If we quote in the logarithmic scoring rule the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get as score \[ S\left(x,N(\mu, \sigma^2 )\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For fixed variance \(\sigma^2\) the logarithmic score is thus equivalent to the squared distance between \(x\) and \(\mu\).
3.2 Expected score and entropy
Entropy of a distribution
Given the scoring rule \(S(x, P)\) we can compute its expectation with regard to \(P\): \[ \begin{split} H(P) &= \text{E}_P\left( S(x, P) \right) \\ &= - \text{E}_P\left(\log p(x)\right) \\ \end{split} \] This expected score is called the entropy of the distribution \(P\).
In the above the distribution \(P\) occurs in two distinct roles. First, it is the model generating the data (note the expectation \(\text{E}_P\) with regard to \(P\)). Second, it is also the model that is evaluated on the observations (via \(-\log p(x)\)). Thus, entropy is a functional of the distributions \(P\).
Shannon-Gibbs entropy
The entropy of a discrete probability distribution \(P\) with probability mass function \(p(x)\) with \(x \in \Omega\) is called Shannon entropy (1948)2. In statistical physics, the Shannon entropy is known as the Gibbs entropy (1878):
\[ H(P) = - \sum_{x \in \Omega} \log p(x) \, p(x) \] The entropy of a discrete distribution is the expected surprise. We can also interpret it as the expected cost or expected code length when the data are generated according to model \(P\) (“sender”, “encoder” ) and we are using the same model \(P\) to describe the data (“receiver”, “decoder”).
Furthermore, Shannon-Gibbs entropy also has a combinatorial interpretation (see Example 3.12).
As \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\) by construction Shannon-Gibbs entropy is bounded below and must be larger or equal to 0.
Differential entropy
Applying the definition of entropy to a continuous probability distribution \(P\) with density \(p(x)\) yields the differential entropy: \[ H(P) = - \int_x \log p(x) \, p(x) \, dx \] Differential entropy can be negative because the logarithm is applied to a density, which, unlike a probability, can take on values greater than one.
Moreover, for continuous random variables, the shape of the density typically changes under variable transformation, such as from \(x\) to \(y\), the differential entropy will change as well and is not invariant under such a transformation so that \(H(P_y) \neq H(P_x)\).
Traditionally, entropy has often been considered as a measure of order, with large entropy corresponding to disorder and low entropy to order. However, this classic interpretation is now viewed as outdated as there are numerous systems that appear ordered but have large entropy, and vice versa.
A better intuition about entropy is the notion of spread or dispersal which is now preferred in Physics3 and this also aligns best with the interpretation of entropy in Statistics and Machine Learning.
As will become clear from the examples, entropy quantifies the spread of the probability mass within a distribution. When the probability mass is concentrated within a specific area, the entropy is low; conversely, when the probability mass is more broadly distributed, the entropy is high.
3.3 Entropy examples
Models with single parameter
Example 3.5 The Shannon-Gibbs entropy of the geometric distribution \(F_x = \text{Geom}(\theta)\) with probability mass function \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\text{E}(x)= 1/\theta\) is \[ \begin{split} H(F_x) &= - \text{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log(0)=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon-Gibbs entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.
Example 3.6 Consider the uniform distribution \(F_x = U(0, a)\) with \(a>0\), support from \(0\) to \(a\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( F_x ) &= - \int_0^a \log\left(\frac{1}{a}\right) \, \frac{1}{a} dx \\ &= \log a \int_0^a \frac{1}{a} dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.
Example 3.7 Starting with the uniform distribution \(F_x = U(0, a)\) from Example 3.6 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(F_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).
The corresponding differential entropy is \[ \begin{split} H( F_y ) &= \int_0^{a^2} \log \left(2 a \sqrt{y}\right) \, 1/\left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( F_y ) \neq H( F_x )\) as differential entropy is not invariant against variable transformations.
Models with multiple parameters
Example 3.8 Entropy of the categorical distribution \(P\) with \(K\) categories.
Assuming class probabilities \(p_1, \ldots, p_K\) the Shannon-Gibbs entropy is \[ H(P) = - \sum_{k=1}^{K } \log(p_k)\, p_k \]
As \(P\) is discrete \(H(P)\) is bounded below by 0. Furthermore, it is also bounded above by \(\log K\) (cf. Example 6.1). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \] The maximum is achieved for the discrete uniform distribution (Example 3.9) and the minimum for a concentrated categorical distribution (Example 3.10).
Example 3.9 Entropy of the discrete uniform distribution \(U_K\):
Let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\log\left(\frac{1}{K}\right)\, \frac{1}{K} = \log K\]
Note that \(\log K\) is the largest value the Shannon-Gibbs entropy can assume with \(K\) classes (cf. Example 6.1) and indicates maximum spread of probability mass.
Example 3.10 Entropy of a categorical distribution with concentrated probability mass:
Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon-Gibbs entropy \[H(P) = \log(1)\times 1 + \log(0)\times 0 + \dots = 0\]
Note that 0 is the smallest value that Shannon-Gibbs entropy can assume and that it corresponds to maximum concentration of probability mass.
Example 3.11 Differential entropy of the normal distribution:
The log density of the univariate normal \(N(\mu, \sigma^2)\) distribution is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Note that \(H(P)\) only depends on the variance parameter and not on the mean parameter. This intuitively clear as only the variance controls the concentration of the probability mass. The entropy grows with the variance as the probability mass becomes more spread out and less concentrated around the mean. For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.
Example 3.12 \(\color{Red} \blacktriangleright\) Entropy of a categorical distribution and the multinomial coefficient:
Let \(\hat{Q}\) be the empirical categorical distribution with \(\hat{q}_k = n_k/n\) the observed frequencies with \(n_k\) counts in class \(k\) and \(n=\sum_{k=1}^K\) total counts.
The number of possible permutation of \(n\) items of \(K\) distinct types is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \]
It turns out that for large \(n\) both quantities are directly linked: \[ H(\hat{Q}) \approx \frac{1}{n} \log W \]
Recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \] With this \[ \begin{split} \log W &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \log (\hat{q}_k) \, \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]
The above combinatorial derivation of entropy is one of the cornerstones of statistical mechanics and is credited to Boltzmann (1877) and Gibbs (1878). The number of elements \(n_1, \ldots, n_K\) in each of the \(K\) classes corresponds to the macrostate and any of the \(W\) different allocations of the \(n\) elements to the \(K\) classes to an underlying microstate. The multinomial coefficient, and hence entropy, is largest when there are only small differences (or none) among the \(n_i\), i.e. when the individual elements are equally spread across the \(K\) bins.
In statistics the above derivation of entropy was rediscovered by Wallis (1962).
A bit of history
The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in the context of thermodynamics. In physics entropy measures the distribution of energy: if energy is concentrated then the entropy is low, and conversely if energy is spread out the entropy is large. The total energy is conserved (first law of thermodynamics) but with time it will diffuse and thus entropy will increase with time (second law of thermodynamics).
The modern probabilistic definition of entropy was discovered in the 1870s by Ludwig Boltzmann (1844–1906) and Josiah W. Gibbs (1839–1903). In statistical mechanics entropy is proportional to the logarithm of the number of microstates (i.e. particular configurations of the system) compatible with the observed macrostate. Typically, in systems where the energy is spread out there are very large numbers of compatible configurations hence this corresponds to large entropy, and conversely, if the energy is concentrated there are only few such configurations, and thus is corresponds to low entropy.
In the 1940–1950’s the notion of entropy turned out to be central also in information theory, a field pioneered by mathematicians such as Ralph Hartley (1888–1970), Solomon Kullback (1907–1994), Alan Turing (1912–1954), Richard Leibler (1914–2003), Irving J. Good (1916–2009), Claude Shannon (1916–2001), and Edwin T. Jaynes (1922–1998), and later further explored by Shun’ichi Amari (1936–), Imre Ciszár (1938–), Bradley Efron (1938–), Philip Dawid (1946–) and many others.
Of the above, Turing and Good were affiliated with the University of Manchester.
We follow the convention that scoring rules are negatively oriented (e.g. Dawid 2007) with the aim to minimise the score (cost, code length, surprise). However, some authors prefer the positively oriented convention with a reversed sign in the definition of \(S(x, P)\) so the score represents a reward that is maximised (e.g. Gneiting and Raftery 2007).↩︎
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x↩︎
For example, see: H. S. Leff. 2007. Entropy, its language, and interpretation. Found. Phys. 37: 1744–1766, D. S. Lemons. 2013. A student’s guide to entropy. CUP, and https://en.wikipedia.org/wiki/Entropy_(energy_dispersal)↩︎