3  Entropy

Entropy, a fundamental concept that originated in physics, is central also in information theory and statistical learning. This chapter introduces information entropy through log-loss and combinatorics.

3.1 Code length and log-loss

Information storage units

When we record information on paper or on a computer, we need an alphabet, whose symbols act as units of stored information:

  • For an alphabet of size \(b=2\) (e.g. symbols \(0\) and \(1\)) the storage unit is called bit (binary digit). To represent \(K=256\) possible states requires \(\log_2 256 = 8\) bits (or 1 byte) of storage.

  • For an alphabet of size \(b=10\) (e.g. Arabic numerals) the unit is called dit (decimal digit). To represent \(K=100\) possible states requires \(\log_{10} 100 = 2\) dits.

Bit and dit are units of information to describe information content. One dit is equal to \(\log_2 10 \approx 3.32\) bits.

If the natural logarithm is used (\(b=e\)) the unit of information is called nat. This is the standard unit of information used in statistics. It is equal to \(\log_2 e \approx 1.44\) bits.

Constant code lengths

We now assume a discrete variable \(x\) with \(K\) possible outcomes \(\Omega = \{\omega_1, \ldots, \omega_K\}\). In order to record a realised state of \(x\) using an alphabet of size \(b\) we require \[ S = \log_b K \] units of information. \(S\) is called the code length or the cost to describe the state of \(x\). The base \(b\) can be larger or smaller than \(K\).

In the above we have tacitly assumed that storage size, code length, and cost requirements are fixed and identical for all possible \(K\) states. This can be made more explicit by writing \[ S = -\log_b \left( \frac{1}{K} \right) \] where \(1/K\) is the equal probability of each of the \(K\) states.

Variable code lengths

In practice the \(K\) states are usually not equally likely. For example, \(x\) might represent letters in a text message, each occurring with different frequency. To account for that we use a distribution \(P\) with pmf \(p(x)\) and allow variable code lengths per state.

Generalising the case of constant code lengths, we use the negative logarithm to map the probability of a state \(x\) to a corresponding cost and code length: \[ S(x, P) = -\log p(x) \] This is called the log-loss or the logarithmic scoring rule. Under this rule, more probable states under model \(P\) get shorter codes (smaller loss/cost) and less probable states get longer codes (larger loss/cost).

As we will see in the examples (Example 3.3, Example 3.4 and Example 3.5) using variable code lengths lets the expected code length be much smaller than for a fixed-length encoding.

The log-loss underpins much of statistical estimation and inference through likelihood and Bayesian methods. It is the leading example of a proper scoring rule, a broader class of loss functions suitable for principled statistical analysis (Note 3.1).

Scoring rules are only defined up to an equivalence class \(S^{\text{equiv}}(x, P) = k S(x, P) + c(x)\), where \(k\) is a positive scaling factor. All equivalent scoring rules have the same minimiser for fixed \(x\).

In the equivalence class for the log-loss, the scaling factor \(k\) determines the base of the logarithm. In the following, we set \(b=e\) and use the natural logarithm and nats as units of information throughout. We will also apply the log-loss to both discrete and continuous variables \(x\) and corresponding models \(P\).

Note 3.1: \(\color{Red} \blacktriangleright\) Proper scoring rules

The logarithmic scoring rule \(S(x, P) = -\log p(x)\) is a particularly important loss function to evaluate probabilistic models. It is unique as the only local strictly proper scoring rule, solely depending on the value of the pdmf at the observed outcome, and not on any other features of the distribution.

However, instead of the negative logarithm one may also employ other suitable loss functions as scoring rule, while still preserving the useful properties of the log-loss. This leads to the wider class of proper scoring rules. In machine learning these are related to Bregman divergences.

Using proper scoring rules other than the log-loss may be preferable, e.g., for computational convenience and simplicity, for numerical stability for robustness to outliers or interpretability.

Example 3.1 Surprise and log-odds:

The negative logarithm of a probability or density is called the surprise or surprisal. The log-loss thus returns the surprise for the outcomes of a random variable.

For a discrete random variable, the surprise \(-\log p\) to observe an outcome with probability \(p=1\) is zero, and conversely the surprise to observe an outcome with probability \(p=0\) is infinite.

The log-odds \[ \log\left( \frac{p}{1-p} \right) = -\log(1-p) - ( -\log p) \] is the difference between the surprise of the complementary event (probability \(1-p\)) and the surprise of the event (probability \(p\)):

Example 3.2 Log-loss for a normal model:

If we quote in the log-loss \(S(x, P) = -\log p(x)\) the normal distribution \(P = N(\mu, \sigma^2)\) with density \(p(x |\mu, \sigma^2)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) we get \[ S\left(x,P\right) = \frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2}\right) \] For a normal model with fixed variance the the log-loss is thus equivalent to the squared distance between \(x\) and \(\mu\).

3.2 Information entropy

Entropy of a distribution

The expectation of the log-loss \[ \begin{split} H(P) &= \operatorname{E}_P\left( S(x, P) \right) \\ &= - \operatorname{E}_P\left(\log p(x)\right) \end{split} \] is the information entropy \(H(P)\) of the distribution \(P\).

The notation \(H(P)\) emphasises that entropy is a functional of the distribution \(P\). Note that in this expression \(P\) serves two distinct roles, first as the data-generating model (the expectation is taken with respect to \(P\)) and second as the model \(P\) quoted in the log-loss.

Information entropy is only defined up to an affine transformation with a positive scaling factor as the underlying loss function is itself also only defined up to an equivalence class \[ H(P)^{\text{equiv}}= - k \operatorname{E}_P\left(\log p(x)\right) + c \] The scaling factor \(k\) determines the units (or the base of the logarithm) and the additive constant \(c = \operatorname{E}_P( c(x))\) can be chosen to fix a reference value (e.g. typically either set the maximum or minimum entropy to zero).

Note 3.2: Other names for information entropy

Information entropy is also called Shannon entropy (for discrete distributions) and differential entropy (for continuous distribution).

In physics, the formula for information entropy is primarily known as Gibbs entropy, but also referred to as Boltzmann-Gibbs entropy (BG entropy) or Boltzmann-Gibbs-Shannon entropy (BGS entropy) entropy. Physical entropy also carries the Boltzmann constant \(k_B=1.380649 \times 10^{-23} J/K\) as scaling factor, giving physical entropy units of energy/temperature.1

We will mostly use the term information entropy.

Concavity

A key property of the entropy \(H(P)\) is that it is strictly concave in \(P\). This means that \[ H( P_{\lambda} ) > (1-\lambda) H(P_0) + \lambda H(P_1) \] for the mixture \(P_{\lambda} = (1-\lambda) P_0 + \lambda P_1\) with \(0 < \lambda < 1\) and \(P_0 \neq P_1\), and follows from the fact that \(-x \log(x)\) is a strictly concave function.

Strict concavity implies that mixing states (represented by \(P_0\) and \(P_1\)) raises the entropy. Further, it ensures that the maximum is unique (if a maximum exists).

Concavity is a result of the properness of the log-loss (Section 4.1.3). Specifically, the entropy is (strictly) concave because the underlying scoring rule is (strictly) proper.

Information entropy for discrete distributions

The information entropy \(H(P)\) of a discrete probability distribution \(P\) with pmf \(p(x)\) with \(x \in \Omega\) is the Shannon entropy: \[ H(P) = - \sum_{x} p(x) \log p(x) \]

By construction, since all \(p(x) \in [0,1]\) and hence \(-\log p(x) \geq 0\), the Shannon entropy is non-negative.

Example 3.3 Entropy of the categorical distribution:

For \(P=\operatorname{Cat}(\boldsymbol p)\) with class probabilities \(\boldsymbol p=(p_1, \ldots, p_K)^T\) the information entropy of \(P\) is \[ H(P) = - \sum_{k=1}^{K } p_k \log p_k \] The maximum entropy (\(\log K\)) is achieved for the discrete uniform distribution (Example 3.4) and the minimum entropy (\(0\)) for a concentrated categorical distribution (Example 3.5). Hence for a categorical distribution \(P\) with \(K\) categories we have \[ 0 \leq H(P) \leq \log K \]

Example 3.4 Entropy of the discrete uniform distribution:

The discrete uniform distribution is \(U_K=\operatorname{Cat}(\boldsymbol p=\mathbf 1_K/K)\) so that \(p_1=p_2= \ldots = p_K = \frac{1}{K}\). Then \[ H(U_K) = - \sum_{k=1}^{K} \frac{1}{K} \log\left(\frac{1}{K}\right) = \log K \]

Note that \(\log K\) is the largest value the information entropy can assume with \(K\) classes (cf. Example 6.3) and indicates maximum spread of probability mass.

Example 3.5 Entropy of a categorical distribution with concentrated probability mass:

Let \(p_1=1\) and \(p_2=p_3=\ldots=p_K=0\). Using \(0\times\log 0=0\) we obtain for the information entropy of \(P=\operatorname{Cat}(\boldsymbol p)\) \[ H(P) = 1 \times \log 1 + 0 \times \log 0 + \dots = 0 \]

Note that zero is the smallest value that the entropy can assume for a categorical distribution and that it corresponds to maximum concentration of probability mass.

Example 3.6 Entropy of the geometric distribution:

With \(P = \operatorname{Geom}(\theta)\) and \(\theta \in [0,1]\), support \(x \in \{1, 2, \ldots \}\) and pmf \(p(x|\theta) = \theta (1-\theta)^{x-1}\), with \(\operatorname{E}(x)= 1/\theta\) and \(\operatorname{Var}(x) = (1-\theta)/(\theta^2)\), the entropy is \[ \begin{split} H(P) &= - \operatorname{E}\left( \log \theta + (x-1) \log(1-\theta) \right)\\ &= -\log \theta+ \left(\frac{1}{\theta}-1\right)\log(1-\theta)\\ &= -\frac{\theta \log \theta + (1-\theta) \log(1-\theta) }{\theta} \end{split} \] Using the identity \(0\times\log 0=0\) we see that the entropy of the geometric distribution for \(\theta = 1\) equals 0, i.e. it achieves the minimum possible information entropy for a discrete distribution. Conversely, as \(\theta \rightarrow 0\) it diverges to infinity.

Information entropy for continuous distributions

The entropy \(H(P)\) of a continuous distribution \(P\) with pdf \(p(x)\) is the differential entropy: \[ H(P) = - \int_x p(x) \log p(x) \, dx \]

Differential entropy can be negative because the logarithm is taken of a density value, and unlike probabilities, densities are not constrained to be less than or equal to one.

Under a change of variables, such as from \(x\) to \(y\), the shape of the pdf typically changes, so that differential entropy changes as well and is not invariant: \(H(P_y) \neq H(P_x)\).

Example 3.7 Entropy of the uniform distribution:

Consider the uniform distribution \(P_x = \operatorname{Unif}(0, a)\) with \(a>0\), support \(x \in [0, a]\) and density \(p(x) = 1/a\). The corresponding entropy is \[ \begin{split} H( P_x ) &= - \int_0^a \frac{1}{a} \, \log\left(\frac{1}{a}\right) \, dx \\ &= \frac{1}{a} \, \log a \,\int_0^a dx \\ &= \log a \,. \end{split} \] The range of the entropy is unbounded for a positive \(a\), and for \(0 < a < 1\) the entropy assumes a negative value.

Example 3.8 Differential entropy under a change of variables:

Starting with the uniform distribution \(P_x = \operatorname{Unif}(0, a)\) from Example 3.7 the variable \(x\) is changed to \(y = x^2\) yielding the distribution \(P_y\) with support from \(0\) to \(a^2\) and density \(p(y) = 1/\left(2 a \sqrt{y}\right)\).

The corresponding entropy is \[ \begin{split} H( P_y ) &= \int_0^{a^2} 1/\left(2 a \sqrt{y}\right) \, \log \left(2 a \sqrt{y}\right) dy \\ &= \left[ \sqrt{y}/a \, \left(\log \left( 2 a \sqrt{y} \right)-1\right) \right]_{y=0}^{y=a^2} \\ &= \log \left(2 a^2\right) -1 \,. \end{split} \] The range of entropy is unbounded, and is negative for \(0 < a < \sqrt{e/2}\approx 1.1658\). As expected \(H( P_y ) \neq H( P_x )\) as differential entropy is not invariant under a change of variables.

Example 3.9 Entropy of the normal distribution:

The log-density of the univariate normal distribution \(P = N(\mu, \sigma^2)\) is \(\log p(x |\mu, \sigma^2) = -\frac{1}{2} \left( \log(2\pi\sigma^2) + \frac{(x-\mu)^2}{\sigma^2} \right)\) with \(\sigma^2 > 0\). The corresponding entropy is with \(\operatorname{E}((x-\mu)^2) = \sigma^2\) \[ \begin{split} H(P) & = -\operatorname{E}\left( \log p(x |\mu, \sigma^2) \right)\\ & = \frac{1}{2} \left( \log(2 \pi \sigma^2)+1\right) \,. \\ \end{split} \] Importantly, the entropy of a normal distribution depends only on its variance \(\sigma^2\), not on its mean \(\mu\). This is intuitively clear as the variance controls the concentration of probability mass. A large variance means that the probability mass is more spread out and thus less concentrated around the mean.

Since the variance \(\sigma^2\) ranges from 0 to infinity, the entropy of the normal distribution is unbounded ranging from negative to positive infinity, taking on the value of zero at \(\sigma^2 < 1/(2 \pi e) \approx 0.0585\).

Example 3.10 \(\color{Red} \blacktriangleright\) Entropy of an exponential family:

Assume \(P(\boldsymbol \eta)\) is an exponential family with canonical parameters \(\boldsymbol \eta\), canonical statistics \(\boldsymbol t(x)\) and log-partition function \(a(\boldsymbol \eta)\) with log-pdmf
\(\log p(x|\boldsymbol \eta) = \langle \boldsymbol \eta, \boldsymbol t(x)\rangle + \log h(x) - a(\boldsymbol \eta)\).

Then with \(\operatorname{E}_{ P(\boldsymbol \eta)}(\boldsymbol t(x))=\boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)\) the entropy is \[ \begin{split} H(P(\boldsymbol \eta) ) &= -\operatorname{E}_{P(\boldsymbol \eta)}( \log p(x|\boldsymbol \eta) ) \\ &= a(\boldsymbol \eta) -\langle \boldsymbol \eta, \boldsymbol \mu_{\boldsymbol t}(\boldsymbol \eta)\rangle - \operatorname{E}_{P(\boldsymbol \eta)}( \log h(x) ) \end{split} \]

Interpretation of entropy

Information entropy \(H(P)\) can be interpreted as the expected cost or expected code length when the data are generated according to model \(P\) (“sender”, “encoder”) and the same model \(P\) is used to describe the data (“receiver”, “decoder”). As will be discussed in Chapter 4 in more detail, using different models for “sending” and “receiving” always increases the cost, hence the entropy is a lower bound on that cost.

Furthermore, information entropy \(H(P)\) quantifies the dispersal of probability mass within the distribution \(P\). When probability mass is concentrated, entropy is low (Example 3.5). Conversely, when the probability mass is more broadly distributed, the entropy is high (Example 3.4).

The notion of entropy as measure of spread is also supported in continuous settings, e.g. the entropy of the normal distribution (Example 3.9) is a monotonically increasing function of the variance.

A distribution \(P\) with low entropy and concentrated probability mass is highly informative. Conversely, a distribution with high entropy and dispersed probability mass is uninformative. This is the basis of the principle of maximum entropy (Chapter 6).

3.3 \(\color{Red} \blacktriangleright\) Entropy in physics

Entropy as macroscopic phenomenon

The concept of entropy was introduced in the mid-19th century within thermodynamics, a phenomenological theory that quantifies heat, work, energy, and the laws governing their transformations. Entropy determines how much of the internal energy is available to do work. Low entropy implies energy is concentrated and available for work, and high entropy means energy is spread out and not available for work. Systems at equilibrium are characterised by maximum entropy (subject to constraints).

In statistical mechanics, entropy and the laws of thermodynamics are explained as emergent, probabilistic consequences of the behaviour of many-particle systems. A microstate provides a complete specification of the current state of the system which evolves by transitioning from one such state to the other.

In the Gibbs approach, hypothetical ensembles of systems are considered, represented by a probability distribution over all microstates. If that distribution is concentrated on relatively few microstates the entropy is low. Conversely, if it is spread over many microstates the entropy is large. Once in equilibrium, the distribution and its entropy is stationary. Its information entropy corresponds the conventional thermodynamic entropy.

In the Boltzmann approach, entropy arises from coarse graining and the introduction of macrostates (e.g., particle counts or energy levels). The entropy of a particular system (not of an ensemble!) is computed from the size of its current macrostate as determined by the number of underlying microstates compatible with it. This gives rise to both the standard formula for information entropy discussed in this chapter (see Example 3.11) and a further generalisation introduced in Section 6.1 (see Example 6.2).

Equilibrium corresponds to the typical macrostate with overwhelmingly many compatible microstates (maximum entropy) and is the state in which the system spends most of its time. Entropy fluctuates as the system evolves and continues to do so even in equilibrium. This explains macroscopic irreversibility as a statistical phenomenon.

Both the Gibbs and Boltzmann approaches support the interpretation of entropy as a measure of dispersal and use the same functional form of entropy.

A further common physical interpretation of entropy is as a measure of disorder, with low entropy corresponding to “order” and high entropy to “disorder”. However, this interpretation is now discouraged because those everyday terms are ambiguous. Many apparently ordered systems can have large entropy, while some disordered-looking systems can have low entropy.

Finally, in physics the concavity of entropy is a crucial property as it underpins the second law of thermodynamics.

\(\color{Red} \blacktriangleright\) Combinatorial derivation

Information entropy can also be obtained by combinatorial considerations. This was the route taken by Boltzmann in 1877 in the early days of statistical mechanics.

Example 3.11 \(\color{Red} \blacktriangleright\) Combinatorial derivation of information entropy:

Let \(D=\{n_1, \ldots, n_K\}\) be counts for \(K\) classes with \(n = \sum_{k=1}^K n_k\) and \(\hat{Q} = \operatorname{Cat}(\hat{\boldsymbol q})\) be the corresponding empirical categorical distribution with class frequencies \(\hat{q}_k = n_k/n\).

The entropy of \(\hat{Q}\) is \(H(\hat{Q}) = -\sum_{k=1}^K \hat{q}_k \log \hat{q}_k\) (Example 3.3).

The macrostate corresponds to the counts \(D\) or equivalently the distribution \(\hat{Q}\). Each particular allocation of the \(n\) elements to the \(K\) groups constitutes a microstate.

The number of allocations of \(n\) distinct items to \(K\) groups compatible with the given counts \(D\) is given by the multinomial coefficient \[ \begin{split} W_K &= \binom{n}{n_1, \ldots, n_K} \\ &= \frac {n!}{n_1! \, n_2! \, \ldots \, n_K! } \end{split} \] This represents the multiplicity of the macrostate.

Using the de Moivre-Sterling formula we can, for large \(n\), approximate the factorial according to \[ \log n! \approx n \log n -n \] and get \[ \begin{split} \log W_K &= \log n! - \sum_{k=1}^K \log n_k!\\ & \approx n \log n -n - \sum_{k=1}^K (n_k \log n_k -n_k) \\ & = \sum_{k=1}^K n_k \log n - \sum_{k=1}^K n_k \log n_k\\ & = - n \sum_{k=1}^K \frac{n_k}{n} \log\left( \frac{n_k}{n} \right)\\ & = -n \sum_{k=1}^K \hat{q}_k \log \hat{q}_k \\ & = n H(\hat{Q}) \end{split} \]

Therefore, information entropy and the multinomial coefficient and are directly related \[ H(\hat{Q}) \approx \frac{1}{n} \log W_K \] with equality for \(n \rightarrow \infty\).

The number of permutations \(W_K\) is far greater when items are uniformly “spread out” over the \(K\) categories than when they are “concentrated”. Hence, the combinatorial perspective supports the interpretation of entropy as a measure of dispersal.

See also Example 6.2 for a related derivation of information entropy with regard to a prior measure (Section 6.1).

3.4 Further reading

For a justification of the logarithmic loss function see, e.g., Good (1952) and Bernardo (1979).

A brief overview of proper scoring rules is found in the supplementary Probability and Distribution Refresher notes.

Akaike (1985) provides an historical account from a statistical perspective of how Boltzmann arrived in 1877 at the formula for information entropy.

Lemons (2013) offers a friendly introduction to entropy in physics and in information theory. This book also discusses the combinatorial derivation of entropy (Example 3.11) as well as various interpretations and applications of entropy.

TipA bit of history

Entropy as a thermodynamic quantity was introduced in 1865 by Rudolph Clausius (1822-1888), along with the first and second law of thermodynamics.

The probabilistic and combinatorial underpinnings of entropy were developed in the period from 1870 to 1900 by Josiah W. Gibbs (1839–1903) and Ludwig Boltzmann (1844–1906). The specific formula for information entropy was found by Boltzmann in 1877.

By the mid-20th century the notion of information entropy was established in statistics. Ralph V. L. Hartley (1888–1970) and Irving J. Good (1916–2009) argued for logarithmic measures of information and Claude Shannon (1916–2001) introduced entropy in information theory (Shannon 1948).


  1. The Boltzmann constant \(k_B\) is one of seven defining constants in the International System of Units.↩︎