3 Entropy
Entropy, a fundamental concept that originated in physics, plays a crucial role in information theory and statistical learning. This chapter introduces entropy via the route of scoring rules.
3.1 Information storage and scoring rules
Information storage units
When we record information on paper or on a computer, we need an alphabet, whose symbols act as units of stored information:
For an alphabet of size \(A=2\) (e.g. symbols \(0\) and \(1\)) the storage unit is called bit (binary digit). To represent \(K=256\) possible states requires \(\log_2 256 = 8\) bits (or 1 byte) of storage.
For an alphabet of size \(A=10\) (e.g. arabic numerals) the unit is called dit (decimal digit). To represent \(K=100\) possible states requires \(\log_{10} 100 = 2\) dits.
Bit and dit are units of information to describe information content. One dit is equal to \(\log_2 10 \approx 3.32\) bits.
If the natural logarithm is used (\(A=e\)) the unit of information is called nat. This is the standard unit of information used in statistics. It is equal to \(\log_2 e \approx 1.44\) bits.
Constant code length
We now assume a discrete variable \(x\) with \(K\) possible outcomes \(\Omega = \{\omega_1, \ldots, \omega_K\}\). In order to record a realised state of \(x\) using an alphabet of size \(A\) we require \[ S = \log_A K \] units of information. The base \(A\) can be larger or smaller than \(K\). We simply set \(A=e\) and use the natural logarithm and nats as units of information throughout. \(S\) is called the code length or the cost to describe the state of \(x\).
In the above we have tacitly assumed that storage size, code length, and cost requirements are fixed and identical for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.
Variable code length
However, in practice, the \(K\) states are often not equally probable. For example \(x\) might be describing the letters in a text message, each occurring with different frequency (e.g. the most common letter in the English language is “e”). Hence, rather than assuming equal probabilities we use a discrete distribution \(P\) with probability mass function \(p(x)\) to describe the different probabilities. As a result, instead of constant code length for each state, we use variable code lengths, with more probable states having shorter codes and less probable states assigned longer codes.
Specifically, generalising from the case of constant code length above, we employ the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] This is called the log-loss or the logarithmic scoring rule1 — see also Note 3.1 and Note 3.2.
As we will see below (Example 3.8, Example 3.9 and Example 3.10) using variable code lengths allows for expected code lengths that are potentially much smaller than using a fixed length \(\log K\) for all states, and hence leads to a more space saving representation.
We will apply the logarithmic scoring rule \(S(x, P) = -\log p(x)\) to both discrete and continuous variables \(x\) and corresponding distributions \(P\). As densities can take on values larger than one the logarithmic loss may become negative when \(P\) is a continuous distribution.
The function \(S(x, P)\) is a scoring rule which is a type of loss function that assesses a probabilistic forecast \(P\) given the observed outcome \(x\).
A desirable property is that the scoring rule is proper, i.e. its risk the expected score \(R_Q(P) =\text{E}_Q\left( S(x, P) \right)\) is minimised when the quoted model \(P\) equals the true data-generating model \(Q\). If the minimum is exclusively achieved at \(P=Q\) then the scoring rule is strictly proper.
Proper scoring rules are widely used in statistics and machine learning because they permit identification of best-approximating models by risk minimisation (Dawid and Musio 2014).
See the supplementary Probability and Distribution Refresher notes for examples of (strictly) proper scoring rules used in statistics and machine learning.
The theory of proper scoring rules (established in statistics) is closely related to that of Bregman divergences (established in optimisation).
The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is uniquely characterised (e.g. Good (1952) and Bernardo (1979)). In particular, it is the only scoring rule that is both
- strictly proper and
- local with the score depending only on the value of the pdmf at \(x\) and not any other features of \(P\).
For statistics the logarithmic scoring rule is of prime importance because it underpins likelihood inference and Bayesian learning.
See also Note 4.1.
Example 3.1 Surprise:
The negative logarithm of a probability is called the surprise or surprisal. The surprise \(-\log p\) to observe a certain outcome (with probability \(p=1\)) is zero, and conversely the surprise to observe an outcome that cannot happen (with probability \(p=0\)) is infinite. The logarithmic scoring rule thus returns the surprise for the outcomes of a discrete random variable.
Example 3.2 Logit and log-odds:
The odds of an event with probability \(p\) are given by the ratio \(\frac{p}{1-p}\).
The log-odds are therefore the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event (with probability \(p\)): \[ \begin{split} \text{logit}(p) &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \] The log-odds function is also know as logit function. It maps the interval \([0,1]\) to the interval \([-\infty, +\infty]\). Its inverse is the logistic function \(\exp(x)/(1+\exp(x))\) mapping the interval \([-\infty, +\infty]\) to the interval \([0,1]\).
Example 3.3 Logarithmic scoring rule for a normal model:
If we quote in the log-loss \(S(x, P) = -\log p(x)\) the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get \[ S\left(x,P\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For a normal model with fixed variance the the log-loss is thus equivalent to the squared distance between \(x\) and \(\mu\).
Example 3.4 Dawid-Sebastiani scoring rule:
The Dawid-Sebastiani scoring rule is given by \[ S\left(x, P\right) = \log \text{Var}(P) + \frac{(x-\text{E}(P))^2}{\text{Var}(P)} \] For normal \(P\) this is equivalent to the log-loss (Example 3.3).
Note that adding a constant or a positive scaling factor to a loss function will not change the location of its minimum, so such loss functions are considered equivalent.
Unlike the log-loss, which is strictly proper, the Dawid-Sebastiani scoring rule is proper. Since all \(P\) with the same mean and variance yield identical scores there are multiple \(P\) achieving the minimum risk (Note 3.1).
3.2 Scoring rules and entropy
Entropy of a distribution
Assume that \(P\) is the model generating the data \(x\). The expectation of the scoring rule \(S(x, P)\) with regard to \(P\) is the risk \(R_P(P)\) of the model \(P\) under itself: \[ \begin{split} R_P(P) &= \text{E}_P\left( S(x, P) \right) \\ &= - \text{E}_P\left(\log p(x)\right) = H(P) \end{split} \] For the log-loss this yields the information entropy \(H(P)\) of the distribution \(P\).2 The notation \(H(P)\) is a reminder that entropy is a functional of the distribution \(P\).
For an overview of the two main interpretations of entropy, disorder versus dispersal, see Note 3.3.
Traditionally, entropy has been considered as a measure of disorder. However, this interpretation is now viewed as outdated and in fact misleading as many apparently ordered systems have large entropy and some disordered-looking systems have low entropy.
A better and more useful intuition is that entropy measures spread or dispersal. This notion is now preferred in physics3 and it also aligns with the interpretation of entropy in statistics and machine learning.
As will become clear from the examples, information entropy quantifies the spread of the probability mass. When the probability mass is concentrated within a specific area, the entropy is low; conversely, when the probability mass is more broadly distributed, the entropy is high.
Shannon-Gibbs entropy
The information entropy \(H(P)\) of a discrete probability distribution \(P\) with pmf \(p(x)\) with \(x \in \Omega\) is called Shannon entropy (Shannon 1948): \[ H(P) = - \sum_{x} \log p(x) \, p(x) \] In statistical physics, it is known as Gibbs entropy, sharing the same mathematical form.
However, physical entropies carry an additional factor, the Boltzmann constant \(k=1.380649 \times 10^{-23} J/K\), to give entropy units of energy/temperature, whereas information entropies lack a physical unit.
The entropy of a discrete distribution is the expected surprise. We can also interpret it as the expected cost or expected code length when the data are generated according to model \(P\) (“sender”, “encoder” ) and we are using the same model \(P\) to describe the data (“receiver”, “decoder”).
By construction, since all \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\), the Shannon-Gibbs entropy is non-negative.
Importantly, the Shannon-Gibbs entropy also has a combinatorial interpretation (see Example 3.12 for details).
Differential entropy
The entropy \(H(P)\) of a continuous distribution \(P\) with pdf \(p(x)\) is called differential entropy: \[ H(P) = - \int_x \log p(x) \, p(x) \, dx \] Differential entropy can be negative because the logarithm is taken of a density value, and unlike probabilities, densities are not constrained to be less than or equal to one.
Also, under a variable change, such as from \(x\) to \(y\), the shape of the pdf typically changes, so that differential entropy changes as well and is not invariant: \(H(P_y) \neq H(P_x)\).
3.3 Entropy examples
Models with single parameter
Example 3.5 Entropy of the geometric distribution:
With \(P = \text{Geom}(\theta)\) and passociated pmf \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\text{E}(x)= 1/\theta\) the Shannon-Gibbs entropy is \[ \begin{split} H(P) &= - \text{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log(0)=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon-Gibbs entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.
Example 3.6 Entropy of the uniform distribution:
Consider the uniform distribution \(P_x = \text{Unif}(0, a)\) with \(a>0\), support \(x \in [0, a]\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( P_x ) &= - \int_0^a \log\left(\frac{1}{a}\right) \, \frac{1}{a} dx \\ &= \log a \int_0^a \frac{1}{a} dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.
Example 3.7 Entropy and variable change:
Starting with the uniform distribution \(P_x = \text{Unif}(0, a)\) from Example 3.6 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(P_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).
The corresponding differential entropy is \[ \begin{split} H( P_y ) &= \int_0^{a^2} \log \left(2 a \sqrt{y}\right) \, 1/\left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( P_y ) \neq H( P_x )\) as differential entropy is not invariant against variable transformations.
Models with multiple parameters
Example 3.8 Entropy of the categorical distribution \(P\) with \(K\) categories:
Assuming class probabilities \(p_1, \ldots, p_K\) the Shannon-Gibbs entropy is \[ H(P) = - \sum_{k=1}^{K } \log(p_k)\, p_k \] The maximum entropy (\(\log K\)) is achieved for the discrete uniform distribution (Example 3.9) and the minimum entropy (\(0\)) for a concentrated categorical distribution (Example 3.10). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \]
Example 3.9 Entropy of the discrete uniform distribution \(U_K\):
Let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\log\left(\frac{1}{K}\right)\, \frac{1}{K} = \log K\]
Note that \(\log K\) is the largest value the Shannon-Gibbs entropy can assume with \(K\) classes (cf. Example 6.1) and indicates maximum spread of probability mass.
Example 3.10 Entropy of a categorical distribution with concentrated probability mass:
Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon-Gibbs entropy \[ H(P) = \log(1)\times 1 + \log(0)\times 0 + \dots = 0 \]
Note that 0 is the smallest value that Shannon-Gibbs entropy can assume and that it corresponds to maximum concentration of probability mass.
Example 3.11 Differential entropy of the normal distribution:
The log-density of the univariate normal distribution \(P = N(\mu, \sigma^2)\) is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Crucially, the entropy of a normal distribution depends only on its variance \(\sigma^2\), not on its mean \(\mu\). This is intuitively clear as the variance controls the concentration of the probability mass. A large variance means that the probability mass is more spread out and thus less concentrated around the mean.
For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.
Example 3.12 \(\color{Red} \blacktriangleright\) Entropy and the multinomial coefficient:
Let \(\hat{Q}\) be the empirical categorical distribution with \(\hat{q}_k = n_k/n\) the observed frequencies with \(n_k\) counts in class \(k\) and \(n=\sum_{k=1}^K\) total counts.
The number of possible permutation of \(n\) items of \(K\) distinct types is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \]
It turns out that for large \(n\) both quantities are directly linked: \[ H(\hat{Q}) \approx \frac{1}{n} \log W \]
Recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \] With this \[ \begin{split} \log W &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \log (\hat{q}_k) \, \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]
The above combinatorial derivation of entropy is one of the cornerstones of statistical mechanics, see Akaike (1985) for a review of historical developments. The number of elements \(n_1, \ldots, n_K\) in each of the \(K\) classes corresponds to the macrostate and any of the \(W\) different allocations of the \(n\) elements to the \(K\) classes to an underlying microstate. The multinomial coefficient, and hence entropy, is largest when there are only small differences (or none) among the \(n_i\), i.e. when the individual elements are equally spread across the \(K\) bins.
Jaynes (2003) cites Graham Wallis (1962) for the rediscovery of the combinatorial foundations of information entropy.
The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in the context of thermodynamics. In physics entropy measures the dispersal of energy in a system. If energy is concentrated (and capacity for work is high) then the entropy is low, and conversely if energy is spread out (and capacity for work is low) the entropy is large. The total energy is conserved4 (first law of thermodynamics) but with time it will diffuse and thus entropy will increase with time (and capacity for work will decrease) (second law of thermodynamics).
The modern probabilistic definition of entropy was discovered in the 1870s by Josiah W. Gibbs (1839–1903) and Ludwig Boltzmann (1844–1906). In statistical mechanics entropy is proportional to the logarithm of the number of microstates (i.e. particular configurations of the system) compatible with the observed macrostate. Typically, in systems where the energy is spread out there are very large numbers of compatible configurations hence this corresponds to large entropy, and conversely, if the energy is concentrated there are only few such configurations, and thus it corresponds to low entropy.
In the 1940–1950’s the notion of entropy turned out to be central also in information theory, a field pioneered by mathematicians such as Ralph Hartley (1888–1970), Solomon Kullback (1907–1994), Alan Turing (1912–1954), Richard Leibler (1914–2003), Irving J. Good (1916–2009), Claude Shannon (1916–2001), and Edwin T. Jaynes (1922–1998), and later further explored by Shun’ichi Amari (1936–), Imre Ciszár (1938–), Bradley Efron (1938–), Philip Dawid (1946–) and many others.
Of the above, Turing and Good were affiliated with the University of Manchester in the 1940–50s.
We follow the convention that scoring rules are loss functions and negatively oriented, so the goal is to minimise the score (cost, code length). However, some authors define \(S(x, P)\) with the opposite sign as a utility function with a positive orientation, in which case the aim is to maximise the score (reward).↩︎
For a proper scoring rule \(R_P(P)\) is the minimum risk with regard to the risk \(R_Q(P) = \text{E}_Q\left( S(x, P) \right)\). See Note 3.1 and Chapter 4.↩︎
See for example Leff (2007) and Entropy (energy dispersal) (Wikipedia).↩︎
Energy conservation itself arises as a consequence of the time-translation symmetry of physical laws, see Noether’s theorem (Wikipedia).↩︎