3 Entropy
Entropy, a fundamental concept that originated in physics, plays a crucial role in information theory and statistical learning. This chapter introduces entropy via the route of scoring rules.
3.1 Information storage and scoring rules
Information storage units
When we record information on paper or on a computer, we need an alphabet, whose symbols act as units of stored information:
For an alphabet of size \(A=2\) (e.g. symbols \(0\) and \(1\)) the storage unit is called bit (binary digit). To represent \(K=256\) possible states requires \(\log_2 256 = 8\) bits (or 1 byte) of storage.
For an alphabet of size \(A=10\) (e.g. Arabic numerals) the unit is called dit (decimal digit). To represent \(K=100\) possible states requires \(\log_{10} 100 = 2\) dits.
Bit and dit are units of information to describe information content. One dit is equal to \(\log_2 10 \approx 3.32\) bits.
If the natural logarithm is used (\(A=e\)) the unit of information is called nat. This is the standard unit of information used in statistics. It is equal to \(\log_2 e \approx 1.44\) bits.
Constant code length
We now assume a discrete variable \(x\) with \(K\) possible outcomes \(\Omega = \{\omega_1, \ldots, \omega_K\}\). In order to record a realised state of \(x\) using an alphabet of size \(A\) we require \[ S = \log_A K \] units of information. The base \(A\) can be larger or smaller than \(K\). We simply set \(A=e\) and use the natural logarithm and nats as units of information throughout. \(S\) is called the code length or the cost to describe the state of \(x\).
In the above we have tacitly assumed that storage size, code length, and cost requirements are fixed and identical for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.
Variable code length
However, in practice, the \(K\) states are often not equally probable. For example \(x\) might be describing the letters in a text message, each occurring with different frequency (e.g. the most common letter in the English language is “e”). Hence, rather than assuming equal probabilities we use a discrete distribution \(P\) with probability mass function \(p(x)\) to describe the different probabilities. As a result, instead of constant code length for each state, we use variable code lengths, with more probable states having shorter codes and less probable states assigned longer codes.
Specifically, generalising from the case of constant code length above, we employ the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] This is called the log-loss or the logarithmic scoring rule1 — see also Note 3.1 and Note 3.2.
As we will see below (Example 3.8, Example 3.9 and Example 3.10) using variable code lengths allows for expected code lengths that are potentially much smaller than using a fixed length \(\log K\) for all states, and hence leads to a more space saving representation.
We will apply the logarithmic scoring rule \(S(x, P) = -\log p(x)\) to both discrete and continuous variables \(x\) and corresponding distributions \(P\). As densities can take on values larger than one the logarithmic loss may become negative when \(P\) is a continuous distribution.
The function \(S(x, P)\) is a scoring rule which is a type of loss function that assesses a probabilistic forecast \(P\) given the observed outcome \(x\).
A desirable property is that the scoring rule is proper, i.e. its risk the expected score \(R_Q(P) =\operatorname{E}_Q\left( S(x, P) \right)\) is minimised when the quoted model \(P\) equals the true data-generating model \(Q\). If the minimum is exclusively achieved at \(P=Q\) then the scoring rule is strictly proper.
Proper scoring rules are widely used in statistics and machine learning because they permit identification of best-approximating models by risk minimisation.
The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is uniquely characterised. In particular, it is the only scoring rule that is both
- strictly proper and
- local with the score depending only on the value of the pdmf at \(x\) and not any other features of \(P\).
For statistics the logarithmic scoring rule is of prime importance because it underpins likelihood inference and Bayesian learning.
See also Note 4.1.
Example 3.1 Surprise:
The negative logarithm of a probability or density is called the surprise or surprisal. The logarithmic scoring rule thus returns the surprise for the outcomes of a random variable.
For a discrete random variable, the surprise \(-\log p\) to observe an outcome with probability \(p=1\) is zero, and conversely the surprise to observe an outcome with probability \(p=0\) is infinite.
Example 3.2 Logit and log-odds:
The odds of an event with probability \(p\) are given by the ratio \(\frac{p}{1-p}\).
The log-odds are therefore the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event (with probability \(p\)): \[ \begin{split} \operatorname{logit}p &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \] The log-odds function is also know as logit function. It maps the interval \([0,1]\) to the interval \([-\infty, +\infty]\). Its inverse is the logistic function \((\exp x)/(1+\exp x)\) mapping the interval \([-\infty, +\infty]\) to the interval \([0,1]\).
Example 3.3 Logarithmic scoring rule for a normal model:
If we quote in the log-loss \(S(x, P) = -\log p(x)\) the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get \[ S\left(x,P\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For a normal model with fixed variance the the log-loss is thus equivalent to the squared distance between \(x\) and \(\mu\).
Example 3.4 Dawid-Sebastiani scoring rule:
The Dawid-Sebastiani scoring rule is given by \[ S\left(x, P\right) = \log \operatorname{Var}(P) + \frac{(x-\operatorname{E}(P))^2}{\operatorname{Var}(P)} \] For normal \(P\) this is equivalent to the log-loss (Example 3.3).
Note that adding a constant or a positive scaling factor to a loss function will not change the location of its minimum, so such loss functions are considered equivalent.
Unlike the logarithmic loss, the Dawid-Sebastiani scoring rule is only proper and not strictly proper. Since all \(P\) with the same mean and variance yield identical scores there are multiple \(P\) achieving the minimum risk (Note 3.1).
3.2 Scoring rules and entropy
Entropy of a distribution
Assume that \(P\) is the data-generating model. The expectation of the scoring rule \(S(x, P)\) with regard to \(P\) is the risk \(R_P(P)\) of the model \(P\) under itself: \[ \begin{split} R_P(P) &= \operatorname{E}_P\left( S(x, P) \right) \\ &= - \operatorname{E}_P\left(\log p(x)\right) = H(P) \end{split} \] For the logarithmic loss this yields the information entropy \(H(P)\) of the distribution \(P\).2 The notation \(H(P)\) is a reminder that entropy is a functional of the distribution \(P\).
The entropy of a distribution is thus the corresponding expected surprise.
Interpretation of information entropy
Information entropy can be interpreted as the expected cost or expected code length when the data are generated according to model \(P\) (“sender”, “encoder”) and the same model \(P\) is used to describe the data (“receiver”, “decoder”). In fact, it turns out that the entropy of \(P\) is the smallest such expected cost possible for data generated under \(P\) and decoded by any model (see Chapter 4).
Furthermore, information entropy quantifies the spread of the probability mass with the distribution \(P\). When the probability mass is concentrated within a specific area, the entropy is low. Conversely, when the probability mass is more broadly distributed, the entropy is high. This interpretation follows directly from the combinatorial derivation of entropy, see Example 3.12 for details.
Traditionally, entropy was also seen a measure of disorder but that view is now outdated, see Section 3.4.
Shannon-Gibbs entropy
The information entropy \(H(P)\) of a discrete probability distribution \(P\) with pmf \(p(x)\) with \(x \in \Omega\) is called Shannon entropy: \[ H(P) = - \sum_{x} p(x) \log p(x) \] In statistical physics, it is known as Gibbs entropy, sharing the same mathematical form.3
By construction, since all \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\), the Shannon-Gibbs entropy is non-negative.
Differential entropy
The entropy \(H(P)\) of a continuous distribution \(P\) with pdf \(p(x)\) is called differential entropy: \[ H(P) = - \int_x p(x) \log p(x) \, dx \] Differential entropy can be negative because the logarithm is taken of a density value, and unlike probabilities, densities are not constrained to be less than or equal to one.
Also, under a variable change, such as from \(x\) to \(y\), the shape of the pdf typically changes, so that differential entropy changes as well and is not invariant: \(H(P_y) \neq H(P_x)\).
3.3 Entropy examples
Models with single parameter
Example 3.5 Entropy of the geometric distribution:
With \(P = \operatorname{Geom}(\theta)\) and passociated pmf \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\operatorname{E}(x)= 1/\theta\) the Shannon-Gibbs entropy is \[ \begin{split} H(P) &= - \operatorname{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log 0=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon-Gibbs entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.
Example 3.6 Entropy of the uniform distribution:
Consider the uniform distribution \(P_x = \operatorname{Unif}(0, a)\) with \(a>0\), support \(x \in [0, a]\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( P_x ) &= - \int_0^a \frac{1}{a} \, \log\left(\frac{1}{a}\right) \, dx \\ &= \frac{1}{a} \, \log a \,\int_0^a dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.
Example 3.7 Entropy and variable change:
Starting with the uniform distribution \(P_x = \operatorname{Unif}(0, a)\) from Example 3.6 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(P_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).
The corresponding differential entropy is \[ \begin{split} H( P_y ) &= \int_0^{a^2} 1/\left(2 a \sqrt{y}\right) \, \log \left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( P_y ) \neq H( P_x )\) as differential entropy is not invariant against variable transformations.
Models with multiple parameters
Example 3.8 Entropy of the categorical distribution:
For \(P=\operatorname{Cat}(\boldsymbol p)\) with class probabilities \(\boldsymbol p=(p_1, \ldots, p_K)^K\) the Shannon-Gibbs entropy of \(P\) is \[ H(P) = - \sum_{k=1}^{K } p_k \log p_k \] The maximum entropy (\(\log K\)) is achieved for the discrete uniform distribution (Example 3.9) and the minimum entropy (\(0\)) for a concentrated categorical distribution (Example 3.10). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \]
Example 3.9 Entropy of the discrete uniform distribution:
The discrete uniform distribution is \(U_K=\operatorname{Cat}(\boldsymbol p=\mathbf 1_K/K)\) so that \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[ H(U_K) = - \sum_{k=1}^{K} \frac{1}{K} \log\left(\frac{1}{K}\right) = \log K \]
Note that \(\log K\) is the largest value the Shannon-Gibbs entropy can assume with \(K\) classes (cf. Example 6.1) and indicates maximum spread of probability mass.
Example 3.10 Entropy of a categorical distribution with concentrated probability mass:
Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log 0=0\) we obtain for the Shannon-Gibbs entropy of \(P=\operatorname{Cat}(\boldsymbol p)\) \[ H(P) = 1 \times \log 1 + 0 \times \log 0 + \dots = 0 \]
Note that zero is the smallest value that Shannon-Gibbs entropy can assume and that it corresponds to maximum concentration of probability mass.
Example 3.11 Differential entropy of the normal distribution:
The log-density of the univariate normal distribution \(P = N(\mu, \sigma^2)\) is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\operatorname{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\operatorname{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Crucially, the entropy of a normal distribution depends only on its variance \(\sigma^2\), not on its mean \(\mu\). This is intuitively clear as the variance controls the concentration of the probability mass. A large variance means that the probability mass is more spread out and thus less concentrated around the mean.
For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.
3.4 Entropy in physics
Traditionally, in physics entropy has been considered as a measure of disorder. However, this interpretation is now viewed as outdated and in fact misleading as many apparently ordered systems have large entropy and some disordered-looking systems have low entropy.
A better and more useful intuition is that entropy measures spread or dispersal. This notion is now preferred in physics and it also aligns with the interpretation of entropy in statistics and machine learning.
Specifically, in physics entropy measures the dispersal of energy in a system. If energy is concentrated and capacity for work is high, the entropy is low. Conversely, if energy is spread out and capacity for work is low, the entropy is large. The total energy itself is conserved4 (first law of thermodynamics). However with time it will diffuse and thus entropy will increase and capacity for work will decrease (second law of thermodynamics).
A key insight in statistical mechanics is that entropy is proportional to the logarithm of the number of microstates compatible with the observed macrostate (see Example 3.12). Typically, there is a large number of high entropy configurations (energy spread out) but only a small number with low entropy (concentrated energy). This also explains why entropy generally increases, simply because it is far more likely to be in a high entropy state.
Example 3.12 \(\color{Red} \blacktriangleright\) Entropy and the multinomial coefficient:
Assume there are \(n\) items allocated to \(K\) classes, with \(n_k\) elements in class \(k\) and \(n=\sum_{k=1}^K\), and let \(\hat{Q} = \operatorname{Cat}(\hat{\boldsymbol q})\) be the corresponding empirical categorical distribution with observed frequencies \(\hat{q}_k = n_k/n\) and entropy \(H(\hat{Q})\).
The observed counts \(D =\{n_1, \ldots, n_K\}\) correspond to the macrostate and any of the \(W_K\) different permutations of the \(n\) elements partitioned into \(K\) classes to an underlying microstate.
The number of microstates compatible with a macrostate is given by the multinomial coefficient \[ \begin{split} W_K &= \binom{n}{n_1, \ldots, n_K} \\ &= \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \end{split} \] The de Moivre-Sterling formula allows, for large \(n\), to approximate the factorial by \[ \log n! \approx n \log n -n \] With this we get \[ \begin{split} \log W_K &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \hat{q}_k \log \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]
Therefore, the multinomial coefficient and entropy are directly linked as \[ H(\hat{Q}) \approx \frac{1}{n} \log W_K \]
This combinatorial derivation of entropy is central to the interpretation of entropy in terms of spread both in statistical mechanics and in statistics.
The multinomial coefficient, and hence entropy, is largest when the individual elements are uniformly spread out with similar \(n_k\) across the \(K\) classes. This serves as a basis for the principle of maximum entropy (Chapter 6) and indeed maximum likelihood (Chapter 7).
3.5 Further reading
See Leff (2007) and Entropy (energy dispersal) (Wikipedia) for a discussion of the interpretation of entropy as a measure of dispersal.
Akaike (1985) provides an account, from a statistical point of view, of some key historical developments of entropy in statistical mechanics.
For a justification of the logarithmic scoring rule see, e.g., Good (1952) and Bernardo (1979).
Dawid and Musio (2014) is a recent introduction to proper scoring rules in statistics.
See also the supplementary Probability and Distribution Refresher notes for further info on proper scoring rules.
The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in thermodynamics.
The probabilistic and combinatorial underpinnings of entropy in statistical mechanics were established in the 1870s by Josiah W. Gibbs (1839–1903) and Ludwig Boltzmann (1844–1906).
By the mid-20th century the notion of information entropy arrived in statistics. Ralph V. L. Hartley (1888–1970) and Irving J. Good (1916–2009) argued for logarithmic measures of information and Claude Shannon (1916–2001) introduced information entropy in Shannon (1948). Jaynes (2003) credits Graham Wallis (1962) for the rediscovery of the combinatorial derivation of information entropy.
We follow the convention that scoring rules are loss functions and negatively oriented, so the goal is to minimise the score (cost, code length). However, some authors define \(S(x, P)\) with the opposite sign as a utility function with a positive orientation, in which case the aim is to maximise the score (reward).↩︎
For a proper scoring rule \(R_P(P)\) is the minimum risk with regard to the risk \(R_Q(P) = \operatorname{E}_Q\left( S(x, P) \right)\). See Note 3.1 and Chapter 4.↩︎
Physical entropies carry an additional factor, the Boltzmann constant \(k_B=1.380649 \times 10^{-23} J/K\) (joule per kelvin), to give entropy units of energy/temperature, whereas information entropies lack a physical unit.↩︎
Energy conservation itself arises as a consequence of the time-translation symmetry of physical laws, see Noether’s theorem (Wikipedia).↩︎