13 Conditioning and Bayes rule
In this chapter we review conditional probabilities. Conditional probability is essential for Bayesian statistical modelling.
13.1 Conditional probability
We consider two random variables \(x\) and \(y\) and assume a joint density (or joint PMF) \(p(x,y)\). By definition \(\int_{x,y} p(x,y) dx dy = 1\).
The marginal densities for the individual random variables \(x\) and \(y\) are given by \(p(x) = \int_y p(x,y) dy\) and \(p(y) = \int_x f(x,y) dx\), respectively. Therefore, the marginal densities are derived from the joint density by integrating over all possible states of the variable that is being excluded. As necessary for any density, the marginal densities also integrate to one, i.e. \(\int_x p(x) dx = 1\) and \(\int_y p(y) dy = 1\).
As alternative to integrating out a random variable in the joint density \(p(x,y)\) we may wish to keep it fixed at some value. For instance we may want to keep \(y\) fixed at \(y_0\). In this case \(p(x, y=y_0)\) is proportional to the conditional density (or PMF) given by the ratio \[ p(x | y=y_0) = \frac{p(x, y=y_0)}{p(y=y_0)} \] In this formula the denominator \(p(y=y_0) = \int_x p(x, y=y_0) dx\) ensures that \(\int_x p(x | y=y_0) dx = 1\), thus it renormalises \(p(x, y=y_0)\) so that it becomes a density integrating to one. To simplify notation, the particular value on which a variable is conditioned is often left out, and we just write \(p(x | y)\).
13.2 Bayes’ theorem
Thomas Bayes (1701-1761) was the first to state Bayes’ theorem on conditional probabilities.
Using the definition of conditional probabilities we see that the joint density can be written as the product of marginal and conditional density in two different ways: \[ p(x,y) = p(x| y) p(y) = p(y | x) p(x) \]
This directly leads to Bayes’ theorem: \[ p(x | y) = p(y | x) \frac{ p(x) }{ p(y)} \] This rule relates the two possible conditional densities (or conditional probability mass functions) for the random variables \(x\) and \(y\), enabling the change of the order of conditioning.
Bayes’s theorem was published in 1763 only after his death by Richard Price (1723-1791):
Pierre-Simon Laplace independently published Bayes’ theorem in 1774 and he was in fact the first to routinely apply it to statistical calculations.
13.3 Conditional mean and variance
The conditional distribution \(P_{x|y}\) with density \(p(x|y)\) has mean \(\text{E}(x| y)\) and variance \(\text{Var}(x|y)\). These are called the conditional mean and conditional variance, respectively.
The conditional mean \(\text{E}(x| y)\) is als denoted by \(\text{E}\left( P_{x|y}\right)\) and \(\text{E}_{P_{x|y}}(x)\). It is obtained by calculating \[ \text{E}(x|y) = \text{E}\left( P_{x|y}\right) = \text{E}_{P_{x|y}}(x) = \begin{cases} \sum_{x} p(x|y) \, x & \text{discrete case} \\ \int_{x } p(x| y) \, x \, dx & \text{continuous case} \\ \end{cases} \]
The conditional variance \(\text{Var}(x| y)\) is als denoted by \(\text{Var}\left( P_{x|y}\right)\) and \(\text{Var}_{P_{x|y}}(x)\).
\[ \text{Var}(x| y) = \text{Var}\left( P_{x|y}\right) = \text{Var}_{P_{x|y}}(x) = \text{E}_{P_{x|y}}( (x-\text{E}_{P_{x|y}}(x))^2) ) = \text{E}( (x-\text{E}(x|y))^2 | y ) \]
The law of total expectation links the means of the marginal distribution and of the conditional distributions, stating that \[ \begin{split} \text{E}(x) &= \text{E}\left( \text{E}(x| y) \right) \\ &= \text{E}_{P_y} \text{E}_{P_{x| y}}(x) \\ &= \text{E}_{P_{x,y}} (x) \\ \end{split} \] Hence, the total mean (left side) is the weighted average (outer expectation) of the various conditional means (inner expectation).
Similarly, the law of total variance states that \[ \begin{split} \text{Var}(x) &= \text{Var}(\text{E}(x| y)) + \text{E}(\text{Var}(x|y)) \\ &= \text{Var}_{P_y} \text{E}_{P_{x| y}}(x) + \text{E}_{P_y} \text{Var}_{P_{x|y}}(x) \end{split} \] The total variance (left side) decomposes into the “explained” variance or “between-group” variance (first term on right side) and the “unexplained” variance or “mean within group” variance (second term on the right side). Again, the outer expectations are with regard to \(P_y\) and the inner expecations with regard to \(P_{x|y}\).
Example 13.1 Mean and variance of a mixture model:
Assume \(K\) groups indicated by a discrete variable \(y \in\{ 1, 2, \ldots, K\}\) with probability \(p(y) = \pi_y\). In each group the observations \(x\) follow a density \(p(x|y)\) with conditional mean \(E(x|y) = \mu_y\) and conditional variance \(\text{Var}(x| y)= \sigma^2_y\). The joint density for \(x\) and \(y\) is \(p(x, y) = \pi_y p(x|y)\). The marginal density for \(x\) is \(p(x) = \sum_{y=1}^K \pi_y p(x|y)\). This is called a mixture model.
The total mean \(\text{E}(x) = \mu_0\) is equal to \[ \mu_0= \sum_{y=1}^K \pi_y \mu_y \]
The total variance \(\text{Var}(x) = \sigma^2_0\) is equal to \[ \sigma^2_0 = \sum_{y=1}^K \pi_y (\mu_y - \mu_0)^2 + \sum_{y=1}^K \pi_y \sigma^2_y \]
13.4 Conditional entropy and entropy chain rules
Similar to mean and variance one can also define conditional versions of entropies. These lead to decomposition of the joint entropy that are also known as entropy chain rules.
Conditional entropy
For the entropy of the joint distribution we find that \[ \begin{split} H( P_{x,y}) &= -\text{E}_{P_{x,y}} \log p(x, y) \\ &= -\text{E}_{P_x} \text{E}_{P_{y| x}} (\log p(x) + \log p(y| x))\\ &= -\text{E}_{P_x} \log p(x) - \text{E}_{P_x} \text{E}_{P_{y| x}} \log p(y| x)\\ &= H(P_{x}) + H(P_{y| x} ) \\ \end{split} \] thus it decomposes into the entropy of the marginal distribution and the conditional entropy defined as \[ H(P_{y| x} ) = - \text{E}_{P_x} \text{E}_{P_{y| x}} \log p(y| x) \] Note that to simplify notation by convention the expectation \(\text{E}_{P_{x}}\) over the variable \(x\) that we condition on (\(x\)) is implicitly assumed.
Conditional cross-entropy
Similarly, for the cross-entropy we get \[ \begin{split} H(Q_{x,y} , P_{x, y}) &= -\text{E}_{Q_{x,y}} \log p(x, y) \\ &= -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\, p(x)\, p(y| x)\, \right)\\ &= -\text{E}_{Q_x} \log p(x) -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log p(y| x) \\ &= H(Q_x, P_x) + H(Q_{y|x}, P_{y|x}) \end{split} \] where the conditional cross-entropy is defined as \[ H(Q_{y|x}, P_{y|x})= -\text{E}_{Q_x} \text{E}_{Q_{y| x}} \log p(y| x) \] Note again the implicit expectation \(\text{E}_{Q_x}\) over \(x\) implied in this notation.
Conditional KL divergence
The KL divergence between the joint distributions can be decomposed as follows: \[ \begin{split} D_{\text{KL}}(Q_{x,y} , P_{x, y}) &= \text{E}_{Q_{x,y}} \log \left(\frac{ q(x, y) }{ p(x, y) }\right)\\ &= \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(x) q(y| x) }{ p(x) p(y| x) }\right)\\ &= \text{E}_{Q_x} \log \left(\frac{ q(x) }{ p(x) }\right) + \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(y| x) }{ p(y| x) }\right) \\ &= D_{\text{KL}}(Q_{x} , P_{x}) + D_{\text{KL}}(Q_{y| x} , P_{y|x}) \\ \end{split} \] with the conditional KL divergence defined as \[ D_{\text{KL}}(Q_{y| x} , P_{y|x}) = \text{E}_{Q_x} \text{E}_{Q_{y| x}} \log \left(\frac{ q(y| x) }{ p(y| x) }\right) \] (again the expectation \(\text{E}_{Q_{x}}\) is usually dropped for convenience). The conditional KL divergence can also be computed from the conditional (cross-)entropies by the familiar relationship \[ D_{\text{KL}}(Q_{y| x} , P_{y|x}) = H(Q_{y|x}, P_{y|x}) - H(Q_{y| x}) \]
Conditional Boltzmann relative entropy
The Boltzmann entropy is the negative of the KL divergence.
Hence: \[ \begin{split} B(Q_{x,y} , P_{x, y}) &= B(Q_{x} , P_{x}) + B(Q_{y| x} , P_{y|x}) \\ &= B(Q_{y} , P_{y}) + B(Q_{x| y} , P_{x|y}) \\ \end{split} \]
13.5 Entropy bounds for the marginal variables
The chain rule for KL divergence directly shows that \[ \begin{split} \underbrace{D_{\text{KL}}(Q_{x,y} , P_{x, y})}_{\text{upper bound}} &= D_{\text{KL}}(Q_{x} , P_{x}) + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x}) }_{\geq 0}\\ &\geq D_{\text{KL}}(Q_{x} , P_{x}) \end{split} \] This means that the KL divergence between the joint distributions forms an upper bound for the KL divergence between the marginal distributions, with the difference given by the conditional KL divergence \(D_{\text{KL}}(Q_{y| x} , P_{y|x})\).
Equivalently, we can state an upper bound for the marginal cross-entropy: \[ \begin{split} \underbrace{H(Q_{x,y} , P_{x, y}) - H(Q_{y| x} )}_{\text{upper bound}} &= H(Q_{x}, P_{x}) + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x}) }_{\geq 0}\\ & \geq H(Q_{x}, P_{x}) \\ \end{split} \] Instead of an upper bound we may as well express this as lower bound for the negative marginal cross-entropy \[. \begin{split} - H(Q_{x}, P_{x}) &= \underbrace{ - H(Q_{x} Q_{y| x} , P_{x, y}) + H(Q_{y| x} )}_{\text{lower bound}} + \underbrace{ D_{\text{KL}}(Q_{y| x} , P_{y|x})}_{\geq 0}\\ & \geq F\left( Q_{x}, Q_{y| x}, P_{x, y}\right)\\ \end{split} \]
Since entropy and KL divergence is closedly linked with maximum likelihood the above bounds play a major role in statistical learning of models with unobserved latent variables (here \(y\)). They form the basis of important methods such as the EM algorithm as well as of variational Bayes.