7 Conditioning and Bayes rule

In this chapter we review conditional probabilities. Conditional probability is essential for Bayesian statistical modelling.

7.1 Conditional probability

Assume we have two random variables \(x\) and \(y\) with a joint density (or joint PMF) \(p(x,y)\). By definition \(\int_{x,y} p(x,y) dx dy = 1\).

The marginal densities for the individual \(x\) and \(y\) are given by \(p(x) = \int_y p(x,y) dy\) and \(p(y) = \int_x f(x,y) dx\). Thus, when computing the marginal densities a variable is removed from the joint density by integrating over all possible states of that variable. It follows also that \(\int_x p(x) dx = 1\) and \(\int_y p(y) dy = 1\), i.e. the marginal densities also integrate to 1.

As alternative to integrating out a random variable in the joint density \(p(x,y)\) we may wish to keep it fixed at some value, say keep \(y\) fixed at \(y_0\). In this case \(p(x, y=y_0)\) is proportional to the conditional density (or PMF) given by the ratio \[ p(x | y=y_0) = \frac{p(x, y=y_0)}{p(y=y_0)} \] The denominator \(p(y=y_0) = \int_x p(x, y=y_0) dx\) is needed to ensure that \(\int_x p(x | y=y_0) dx = 1\), thus it renormalises \(p(x, y=y_0)\) so that it is a proper density.

To simplify notation, the specific value on which a variable is conditioned is often left out so we just write \(p(x | y)\).

7.2 Bayes’ theorem

Thomas Bayes (1701-1761) was the first to state Bayes’ theorem on conditional probabilities.

Using the definition of conditional probabilities we see that the joint density can be written as the product of marginal and conditional density in two different ways: \[ p(x,y) = p(x| y) p(y) = p(y | x) p(x) \]

This directly leads to Bayes’ theorem: \[ p(x | y) = p(y | x) \frac{ p(x) }{ p(y)} \] This rule relates the two possible conditional densities (or conditional probability mass functions) for two random variables \(x\) and \(y\). It thus allows to reverse the ordering of conditioning.

Bayes’s theorem was published in 1763 only after his death by Richard Price (1723-1791):

Pierre-Simon Laplace independently published Bayes’ theorem in 1774 and he was in fact the first to routinely apply it to statistical calculations.

7.3 Conditional mean and variance

The mean \(\text{E}(x| y)\) and variance \(\text{Var}(x|y)\) of the conditional distribution with density \(p(x|y)\) are called conditional mean and conditional variance.

The law of total expectation states that \[ \text{E}(x) = \text{E}( \text{E}(x| y) ) \]

The law of total variance states that \[ \text{Var}(x) = \text{Var}(\text{E}(x| y)) + \text{E}(\text{Var}(x|y)) \] The first term is the “explained” or “between-group” variance, and the second the “unexplained” or “mean within group” variance.

Example 7.1 Mean and variance of a mixture model:

Assume \(K\) groups indicated by a discrete variable \(y = 1, 2, \ldots, K\) with probability \(p(y) = \pi_y\). In each group the observations \(x\) follow a density \(p(x|y)\) with conditional mean \(E(x|y) = \mu_y\) and conditional variance \(\text{Var}(x| y)= \sigma^2_y\). The joint density for \(x\) and \(y\) is \(p(x, y) = \pi_y p(x|y)\). The marginal density for \(x\) is \(\sum_{y=1}^K \pi_y p(x|y)\). This is called a mixture model.

The total mean \(\text{E}(x) = \mu_0\) is equal to \(\sum_{y=1}^K \pi_y \mu_y\).

The total variance \(\text{Var}(x) = \sigma^2_0\) is equal to \[ \sum_{y=1}^K \pi_y (\mu_y - \mu_0)^2 + \sum_{y=1}^K \pi_y \sigma^2_y \]

7.4 Conditional entropy and entropy chain rules

For the entropy of the joint distribution we find that \[ \begin{split} H( P_{x,y}) &= -\text{E}_{P_{x,y}} \log p(x, y) \\ &= -\text{E}_{P_x} \text{E}_{P_{y| x}} (\log p(x) + \log p(y| x)\\ &= -\text{E}_{P_x} \log p(x) - \text{E}_{P_x} \text{E}_{P_{y| x}} \log p(y| x)\\ &= H(P_{x}) + H(P_{y| x} ) \\ \end{split} \] thus it decomposes into the entropy of the marginal distribution and the conditional entropy defined as \[ H(P_{y| x} ) = - \text{E}_{P_x} \text{E}_{P_{y| x}} \log p(y| x) \] Note that to simplify notation by convention the expectation \(\text{E}_{P_{x}}\) over the variable \(x\) that we condition on (\(x\)) is implicitly assumed.

Similarly, for the cross-entropy we get \[ \begin{split} H(Q_{x,y} , P_{x, y}) &= -\text{E}_{Q_{x,y}} \log p(x, y) \\ &= -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\, p(x)\, p(y| x)\, \right)\\ &= -\text{E}_{Q_x} \log p(x) -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log p(y| x) \\ &= H(Q_x, P_x) + H(Q_{y|x}, P_{y|x}) \end{split} \] where the conditional cross-entropy is defined as \[ H(Q_{y|x}, P_{y|x})= -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log p(y| x) \] Note again the implicit expectation \(\text{E}_{Q_x}\) over \(x\) implied in this notation.

The KL divergence between the joint distributions can be decomposed as follows: \[ \begin{split} D_{\text{KL}}(Q_{x,y} , P_{x, y}) &= \text{E}_{Q_{x,y}} \log \left(\frac{ q(x, y) }{ p(x, y) }\right)\\ &= \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(x) q(y| x) }{ p(x) p(y| x) }\right)\\ &= \text{E}_{Q_x} \log \left(\frac{ q(x) }{ p(x) }\right) + \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(y| x) }{ p(y| x) }\right) \\ &= D_{\text{KL}}(Q_{x} , P_{x}) + D_{\text{KL}}(Q_{y| x} , P_{y|x}) \\ \end{split} \] with the conditional KL divergence or conditional relative entropy defined as \[ D_{\text{KL}}(Q_{y| x} , P_{y|x}) = \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(y| x) }{ p(y| x) }\right) \] (again the expectation \(\text{E}_{Q_{x}}\) is usually dropped for convenience). The conditional relative entropy can also be computed from the conditional (cross-)entropies by \[ D_{\text{KL}}(Q_{y| x} , P_{y|x}) = H(Q_{y|x}, P_{y|x}) - H(Q_{y| x}) \]

The above decompositions for the entropy, the cross-entropy and relative entropy are known as entropy chain rules.

7.5 Entropy bounds for the marginal variables

The chain rule for KL divergence directly shows that \[ \begin{split} \underbrace{D_{\text{KL}}(Q_{x,y} , P_{x, y})}_{\text{upper bound}} &= D_{\text{KL}}(Q_{x} , P_{x}) + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x}) }_{\geq 0}\\ &\geq D_{\text{KL}}(Q_{x} , P_{x}) \end{split} \] This means that the KL divergence between the joint distributions forms an upper bound for the KL divergence between the marginal distributions, with the difference given by the conditional KL divergence \(D_{\text{KL}}(Q_{y| x} , P_{y|x})\).

Equivalently, we can state an upper bound for the marginal cross-entropy: \[ \begin{split} \underbrace{H(Q_{x,y} , P_{x, y}) - H(Q_{y| x} )}_{\text{upper bound}} &= H(Q_{x}, P_{x}) + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x}) }_{\geq 0}\\ & \geq H(Q_{x}, P_{x}) \\ \end{split} \] Instead of an upper bound we may as well express this as lower bound for the negative marginal cross-entropy \[. \begin{split} - H(Q_{x}, P_{x}) &= \underbrace{ - H(Q_{x} Q_{y| x} , P_{x, y}) + H(Q_{y| x} )}_{\text{lower bound}} + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x})}_{\geq 0}\\ & \geq F\left( Q_{x}, Q_{y| x}, P_{x, y}\right)\\ \end{split} \]

Since entropy and KL divergence is closedly linked with maximum likelihood the above bounds play a major role in statistical learning of models with unobserved latent variables (here \(y\)). They form the basis of important methods such as the EM algorithm as well as of variational Bayes.

6 Optimality properties and conclusion

8 Models with latent variables and missing data