3  Entropy

Entropy, a fundamental concept that originated in physics, plays a crucial role in information theory and statistical learning. This chapter introduces entropy via the route of scoring rules.

3.1 Information storage and scoring rules

Information storage units

When we record information on paper or on a computer, we need an alphabet, whose symbols act as units of stored information:

  • For an alphabet of size \(A=2\) (e.g. symbols \(0\) and \(1\)) the storage unit is called bit (binary digit). To represent \(K=256\) possible states requires \(\log_2 256 = 8\) bits (or 1 byte) of storage.

  • For an alphabet of size \(A=10\) (e.g. arabic numerals) the unit is called dit (decimal digit). To represent \(K=100\) possible states requires \(\log_{10} 100 = 2\) dits.

Bit and dit are units of information to describe information content. One dit is equal to \(\log_2 10 \approx 3.32\) bits.

If the natural logarithm is used (\(A=e\)) the unit of information is called nat. This is the standard unit of information used in statistics. It is equal to \(\log_2 e \approx 1.44\) bits.

Information storage with constant code length

We now assume a discrete variable \(x\) with \(K\) possible outcomes \(\Omega = \{\omega_1, \ldots, \omega_K\}\). In order to record a realised state of \(x\) using an alphabet of size \(A\) we require \[ S = \log_A K \] units of information. The base \(A\) can be larger or smaller than \(K\). We simply set \(A=e\) and use the natural logarithm and nats as units of information throughout. \(S\) is called the code length or the cost to describe the state of \(x\).

In the above we have tacitly assumed that storage size, code length, and cost requirements are fixed and identical for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.

Variable code length and scoring rules

However, in practice, the \(K\) states are often not equally probable. For example \(x\) might be describing the letters in a text message, each occuring with different frequency (e.g. the most common letter in the English language is “e”). Hence, rather than assuming equal probabilities we use a discrete distribution \(P\) with probability mass function \(p(x)\) to describe the different probabilities. As a result, instead of constant code length for each state, we use variable code lengths, with more probable states having shorter codes and less probable states assigned longer codes.

Specifically, generalising from the case of constant code length above, we employ the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] This is called logarithmic loss or logarithmic scoring rule1 — see also Note 3.1 and Note 3.2.

As we will see below (Example 3.7, Example 3.8 and Example 3.9) using variable code lengths allows for expected code lengths that are potentially much smaller than using a fixed length \(\log K\) for all states, and hence leads to a more space saving representation.

We will apply the logarithmic scoring rule \(S(x, P) = -\log p(x)\) to both discrete and continuous variables \(x\) and corresponding distributions \(P\). As densities can take on values larger than 1 the logarithmic loss may become negative when \(P\) is a continuous distribution.

Note 3.1: \(\color{Red} \blacktriangleright\) General scoring rules

The function \(S(x, P)\) is a scoring rule, a loss function that assesses a probabilistic forecast \(P\) given the observed outcome \(x\).

A desirable property is that the scoring rule is proper, i.e. its risk the expected score \(\text{E}_Q\left( S(x, P) \right)\) is minimised when the quoted model \(P\) equals the true data-generating model \(Q\). If the minimum is unique then the scoring rule is strictly proper.

Proper scoring rules are widely used in statistics and machine learning because they permit identification of best-approximating models by risk minimisation.

The theory of proper scoring rules is closely related to that of Bregman divergences.

Note 3.2: \(\color{Red} \blacktriangleright\) Logarithmic scoring rule

The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is uniquely characterised (e.g. Hartley 1928, Shannon 1948, Good 1952, Bernardo 1979). In particular, it is the only scoring rule that is both

  • strictly proper and
  • local with the score depending only on the value of the pdmf at \(x\).

While there are other choices of suitable scoring rules (and some common in machine learning) we will exclusively use the logarithmic scoring rule because it underpins classical likelihood inference and Bayesian learning.

See also Example 4.2.

Example 3.1 Surprise:

The negative logarithm of a probability is called the surprise or surprisal. The surprise \(-\log p\) to observe a certain outcome (with probability \(p=1\)) is zero, and conversely the surprise to observe an outcome that cannot happen (with probability \(p=0\)) is infinite.

Example 3.2 Logit and log-odds:

The odds of an event with probability \(p\) are given by the ratio \(\frac{p}{1-p}\).

The log-odds are therefore the difference of the surprise of the complementary event (with probability \(1-p\)) and the surprise of the event (with probability \(p\)): \[ \begin{split} \text{logit}(p) &= \log\left( \frac{p}{1-p} \right) \\ &= -\log(1-p) - ( -\log p)\\ \end{split} \] The log-odds function is also know as logit function. It maps the interval \([0,1]\) to the interval \([-\infty, +\infty]\). Its inverse is the logistic function \(\exp(x)/(1+\exp(x))\) mapping the interval \([-\infty, +\infty]\) to the interval \([0,1]\).

Example 3.3 Logarithmic score and normal distribution:

If we quote in the logarithmic scoring rule the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get as score \[ S\left(x,N(\mu, \sigma^2 )\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For fixed variance \(\sigma^2\) the logarithmic score is thus equivalent to the squared distance between \(x\) and \(\mu\).

3.2 Expected score and entropy

Entropy of a distribution

Given the scoring rule \(S(x, P)\) we can compute its expectation assuming \(x \sim P\), i.e. that the quoted model and the model for \(x\) are identical: \[ \begin{split} H(P) &= \text{E}_P\left( S(x, P) \right) \\ &= - \text{E}_P\left(\log p(x)\right) \\ \end{split} \] For the logarithmic scoring rule this is the entropy of the distribution \(P\).2

As note, in the above the distribution \(P\) serves two distinct purposes. First, it acts as data-generating model (as indicated by the expectation \(\text{E}_P\) with regard to \(P\)). Second, it is the model that is evaluated on the observations (through \(-\log p(x)\)). Therefore, entropy can be viewed as a functional of the distribution \(P\).

Shannon-Gibbs entropy

The entropy of a discrete probability distribution \(P\) with probability mass function \(p(x)\) with \(x \in \Omega\) is called Shannon entropy (1948)3. In statistical physics, the Shannon entropy is known as the Gibbs entropy (1878):

\[ H(P) = - \sum_{x \in \Omega} \log p(x) \, p(x) \] The entropy of a discrete distribution is the expected surprise. We can also interpret it as the expected cost or expected code length when the data are generated according to model \(P\) (“sender”, “encoder” ) and we are using the same model \(P\) to describe the data (“receiver”, “decoder”).

Furthermore, the Shannon-Gibbs entropy also has a combinatorial interpretation (see Example 3.11).

As \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\), so the Shannon-Gibbs entropy is bounded below and is non-negative.

Differential entropy

Applying the definition of entropy to a continuous probability distribution \(P\) with density \(p(x)\) yields the differential entropy: \[ H(P) = - \int_x \log p(x) \, p(x) \, dx \] Differential entropy can be negative because the logarithm is applied to a density, which, unlike a probability, can take on values greater than one.

Moreover, for continuous random variables, the shape of the density typically changes under variable transformation, such as from \(x\) to \(y\), the differential entropy will change as well and is not invariant under such a transformation so that \(H(P_y) \neq H(P_x)\).

Note 3.3: \(\color{Red} \blacktriangleright\) Interpretation of entropy — spread versus disorder

Traditionally, entropy has been considered as a measure of order, with large entropy corresponding to disorder and low entropy to order. However, this interpretation is now viewed as outdated and in fact misleading as many apparently ordered systems have large entropy and some disordered-looking systems have low entropy.

A better and more useful intuition is that entropy measures spread or dispersal. This notion is now preferred in Physics4 and it also aligns with the interpretation of entropy in Statistics and Machine Learning.

As will become clear from the examples, entropy quantifies the spread of the probability mass. When the probability mass is concentrated within a specific area, the entropy is low; conversely, when the probability mass is more broadly distributed, the entropy is high.

3.3 Entropy examples

Models with single parameter

Example 3.4 The Shannon-Gibbs entropy of the geometric distribution \(F_x = \text{Geom}(\theta)\) with probability mass function \(p(x|\theta) = \theta (1-\theta)^{x-1}\), \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and \(\text{E}(x)= 1/\theta\) is \[ \begin{split} H(F_x) &= - \text{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log(0)=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible Shannon-Gibbs entropy. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.

Example 3.5 Consider the uniform distribution \(F_x = \text{Unif}(0, a)\) with \(a>0\), support \(x \in [0, a]\) and density \(p(x) = 1/a\). The corresponding differential entropy is \[ \begin{split} H( F_x ) &= - \int_0^a \log\left(\frac{1}{a}\right) \, \frac{1}{a} dx \\ &= \log a \int_0^a \frac{1}{a} dx \\ &= \log a \,. \end{split} \] Note that for \(0 < a < 1\) the differential entropy is negative.

Example 3.6 Starting with the uniform distribution \(F_x = \text{Unif}(0, a)\) from Example 3.5 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(F_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).

The corresponding differential entropy is \[ \begin{split} H( F_y ) &= \int_0^{a^2} \log \left(2 a \sqrt{y}\right) \, 1/\left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] This is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( F_y ) \neq H( F_x )\) as differential entropy is not invariant against variable transformations.

Models with multiple parameters

Example 3.7 Entropy of the categorical distribution \(P\) with \(K\) categories.

Assuming class probabilities \(p_1, \ldots, p_K\) the Shannon-Gibbs entropy is \[ H(P) = - \sum_{k=1}^{K } \log(p_k)\, p_k \]

As \(P\) is discrete \(H(P)\) is bounded below by 0. Furthermore, it is also bounded above by \(\log K\) (cf. Example 6.1). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \] The maximum (\(\log K\)) is achieved for the discrete uniform distribution (Example 3.8) and the minimum (\(0\)) for a concentrated categorical distribution (Example 3.9).

Example 3.8 Entropy of the discrete uniform distribution \(U_K\):

Let \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[H(U_K) = - \sum_{k=1}^{K}\log\left(\frac{1}{K}\right)\, \frac{1}{K} = \log K\]

Note that \(\log K\) is the largest value the Shannon-Gibbs entropy can assume with \(K\) classes (cf. Example 6.1) and indicates maximum spread of probability mass.

Example 3.9 Entropy of a categorical distribution with concentrated probability mass:

Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log(0)=0\) we obtain for the Shannon-Gibbs entropy \[H(P) = \log(1)\times 1 + \log(0)\times 0 + \dots = 0\]

Note that 0 is the smallest value that Shannon-Gibbs entropy can assume and that it corresponds to maximum concentration of probability mass.

Example 3.10 Differential entropy of the normal distribution:

The log density of the univariate normal \(N(\mu, \sigma^2)\) distribution is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding differential entropy is with \(\text{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\text{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Crucially, the entropy of a normal distribution depends only on its variance \(\sigma^2\), not on its mean  \(\mu\). This is intuitively clear as the variance controls the concentration of the probability mass. A large variance means that the probability mass is more spread out and thus less concentrated around the mean.

For \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\) the differential entropy is negative.

Example 3.11 \(\color{Red} \blacktriangleright\) Entropy and the multinomial coefficient:

Let \(\hat{Q}\) be the empirical categorical distribution with \(\hat{q}_k = n_k/n\) the observed frequencies with \(n_k\) counts in class \(k\) and \(n=\sum_{k=1}^K\) total counts.

The number of possible permutation of \(n\) items of \(K\) distinct types is given by the multinomial coefficient \[ W = \binom{n}{n_1, \ldots, n_K} = \frac {n!}{n_1! \times n_2! \times\ldots \times n_K! } \]

It turns out that for large \(n\) both quantities are directly linked: \[ H(\hat{Q}) \approx \frac{1}{n} \log W \]

Recall the Moivre-Sterling formula which for large \(n\) allow to approximate the factorial by \[ \log n! \approx n \log n -n \] With this \[ \begin{split} \log W &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \log (\hat{q}_k) \, \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]

The above combinatorial derivation of entropy is one of the cornerstones of statistical mechanics and is credited to Boltzmann (1877) and Gibbs (1878). The number of elements \(n_1, \ldots, n_K\) in each of the \(K\) classes corresponds to the macrostate and any of the \(W\) different allocations of the \(n\) elements to the \(K\) classes to an underlying microstate. The multinomial coefficient, and hence entropy, is largest when there are only small differences (or none) among the \(n_i\), i.e. when the individual elements are equally spread across the \(K\) bins.

In statistics the above derivation of entropy was rediscovered by Wallis (1962).

TipA bit of history

The concept of entropy was first introduced in 1865 by Rudolph Clausius (1822-1888) in the context of thermodynamics. In physics entropy measures the dispersal of energy in a system. If energy is concentrated (and capacity for work is high) then the entropy is low, and conversely if energy is spread out (and capacity for work is low) the entropy is large. The total energy is conserved5 (first law of thermodynamics) but with time it will diffuse and thus entropy will increase with time (and capacity for work will decrease) (second law of thermodynamics).

The modern probabilistic definition of entropy was discovered in the 1870s by Ludwig Boltzmann (1844–1906) and Josiah W. Gibbs (1839–1903). In statistical mechanics entropy is proportional to the logarithm of the number of microstates (i.e. particular configurations of the system) compatible with the observed macrostate. Typically, in systems where the energy is spread out there are very large numbers of compatible configurations hence this corresponds to large entropy, and conversely, if the energy is concentrated there are only few such configurations, and thus it corresponds to low entropy.

In the 1940–1950’s the notion of entropy turned out to be central also in information theory, a field pioneered by mathematicians such as Ralph Hartley (1888–1970), Solomon Kullback (1907–1994), Alan Turing (1912–1954), Richard Leibler (1914–2003), Irving J. Good (1916–2009), Claude Shannon (1916–2001), and Edwin T. Jaynes (1922–1998), and later further explored by Shun’ichi Amari (1936–), Imre Ciszár (1938–), Bradley Efron (1938–), Philip Dawid (1946–) and many others.

Of the above, Turing and Good were affiliated with the University of Manchester in the 1940–50s.


  1. We follow the convention that scoring rules are loss functions and negatively oriented, so the goal is to minimise the score (cost, code length). However, some authors define \(S(x, P)\) with the opposite sign as a utility function with a positive orientation, in which case the aim is to maximise the score (reward).↩︎

  2. For other proper scoring rules it is called the generalised entropy or the minimum risk.↩︎

  3. Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x↩︎

  4. For example, see: H. S. Leff. 2007. Entropy, its language, and interpretation. Found. Phys. 37: 1744–1766, D. S. Lemons. 2013. A student’s guide to entropy. CUP, and https://en.wikipedia.org/wiki/Entropy_(energy_dispersal)↩︎

  5. Energy conservation itself arises as a consequence of the time-translation symmetry of physical laws, see Noether’s theorem.↩︎