8 Models with latent variables and missing data
8.1 Complete data log-likelihood versus observed data log-likelihood
It is frequently the case that we need to employ models where not all variables are observable and the corresponding data are missing.
For example consider two random variables \(x\) and \(y\) with a joint density \[ p(x, y| \boldsymbol \theta) \] and parameters \(\boldsymbol \theta\). If we observe data \(D_x = \{ x_1, \ldots, x_n\}\) and \(D_y = \{ y_1, \ldots, y_n\}\) for \(n\) samples we can use the complete data log-likelihood \[ l_n(\boldsymbol \theta| D_x, D_y) = \sum_{i=1}^n \log p(x_i, y_i| \boldsymbol \theta) \] to estimate \(\boldsymbol \theta\). Recall that \[ l_n(\boldsymbol \theta| D_x, D_y) =-n H(\hat{Q}_{x,y}, P_{x, y|\boldsymbol \theta}) \] where \(\hat{Q}_{x,y}\) is the empirical joint distribution based on both \(D_x\) and \(D_y\) and \(P_{x, y|\boldsymbol \theta}\) the joint model, so maximising the complete data log-likelihood minimises the cross-entropy \(H(\hat{Q}_{x,y}, P_{x, y|\boldsymbol \theta})\).
Now assume that \(y\) is not observable and hence is a so-called latent variable. Then we don’t have observations \(D_y\) and therefore cannot use the complete data likelihood. Instead, for maximum likelihood estimation with missing data we need to use the observed data log-likelihood.
From the joint density we obtain the marginal density for \(x\) by integrating out the unobserved variable \(y\): \[ p(x | \boldsymbol \theta) = \int_y p(x, y| \boldsymbol \theta) dy \] Using the marginal model we then compute the observed data log-likelihood \[ l_n(\boldsymbol \theta| D_x) = \sum_{i=1}^n \log p(x_i| \boldsymbol \theta) =\sum_{i=1}^n \log \int_y p(x_i, y| \boldsymbol \theta) dy \] Note that only the data \(D_x\) are used.
Maximum likelihood estimation based on the marginal model proceeds as usual by maximising the corresponding observed data likelihood function which is \[ l_n(\boldsymbol \theta| D_x) = -n H(\hat{Q}_{x}, P_{x|\boldsymbol \theta}) \] where \(\hat{Q}_{x}\) is the empirical distribution based only on \(D_x\) and \(P_{x|\boldsymbol \theta}\) is the model family. Hence, maximising the observed data log-likelihood minimises the cross-entropy \(H(\hat{Q}_{x}, P_{x|\boldsymbol \theta})\).
Example 8.1 Two group normal mixture model:
Assume we have two groups labelled by \(y=1\) and \(y=2\) (thus the variable \(y\) is discrete). The data \(x\) observed in each group are normal with means \(\mu_1\) and \(\mu_2\) and variances \(\sigma^2_1\) and \(\sigma^2_2\), respectively. The probability of group \(1\) is \(\pi_1 = p\) and the probability of group \(2\) is \(\pi_2=1-p\). The density of the joint model for \(x\) and \(y\) is \[ p(x, y| \boldsymbol \theta) = \pi_y N(x| \mu_y, \sigma_y) \] The model parameters are \(\boldsymbol \theta= (p, \mu_1, \mu_2, \sigma^2_1, \sigma^2_2)^T\) and they can be inferred from the complete data comprised of \(D_x = \{x_1, \ldots, x_n\}\) and the group allocations \(D_y=\{y_1, \ldots, y_n\}\) of each sample using the complete data log-likelihood \[ l_n(\boldsymbol \theta| D_x, D_y ) =\sum_{i=1}^n \log \pi_{y_i} + \sum_{i=1}^n \log N(x_i| \mu_{y_i}, \sigma_{y_i}) \]
However, typically we do not know the class allocation \(y\) and thus we need to use the marginal model for \(x\) alone which has density \[ \begin{split} p(x| \boldsymbol \theta) &= \sum_{y=1}^2 \pi_y N(\mu_y, \sigma^2_y) \\ &= p N(x| \mu_1, \sigma^2_1) + (1-p) N(x | \mu_2, \sigma^2_2)\\ \end{split} \] This is an example of a two-component mixture model. The corresponding observed data log-likelihood is \[ l_n(\boldsymbol \theta| D_x ) = \sum_{i=1}^n \log \sum_{y=1}^2 \pi_y N(x |\mu_y, \sigma^2_y) \] Note that the form of the observed data log-likelihood is more complex than that of the complete data log-likelihood because it contains the logarithm of a sum that cannot be simplified. It is used to estimate the model parameters \(\boldsymbol \theta\) from \(D_x\) without requiring knowledge of the class allocations \(D_y\).
Example 8.2 Alternative computation of the observed data likelihood:
An alternative way to arrive at the observed data likelihood is to marginalise the complete data likelihood. \[ L_n(\boldsymbol \theta| D_x, D_y) = \prod_{i=1}^n p(x_i, y_i| \boldsymbol \theta) \] and \[ L_n(\boldsymbol \theta| D_x) = \int_{y_1, \ldots, y_n} \prod_{i=1}^n p(x_i, y_i| \boldsymbol \theta) dy_1 \ldots dy_n \] The integration (sum) and the multiplication can be interchanged as per Generalised Distributive Law leading to \[ L_n(\boldsymbol \theta| D_x) = \prod_{i=1}^n \int_{y} p(x_i, y| \boldsymbol \theta) dy \] which is the same as constructing the likelihood from the marginal density.
8.2 Estimation of the unobservable latent states using Bayes theorem
After estimating the marginal model it is straightforward to obtain a probabilistic prediction about the state of the latent variables \(y_1, \ldots, y_n\). Since \[ p(x, y | \boldsymbol \theta) = p( x|\boldsymbol \theta) \, p(y | x, \boldsymbol \theta) = p( y|\boldsymbol \theta) \, p(x | y, \boldsymbol \theta) \] given an estimate \(\hat{\boldsymbol \theta}\) we are able to compute for each observation \(x_i\) \[ p(y_i | x_i , \hat{\boldsymbol \theta}) = \frac{p(x_i, y_i | \hat{\boldsymbol \theta} ) }{p(x_i|\hat{\boldsymbol \theta})} =\frac{ p( y_i|\hat{\boldsymbol \theta}) \, p(x_i | y_i, \hat{\boldsymbol \theta}) }{p(x_i|\hat{\boldsymbol \theta})} \] the probabilities / densities of all states of \(y_i\) (note this an application of Bayes’ theorem).
Example 8.3 Latent states of two group normal mixture model:
Continuing from Example 8.1 above we assume the marginal model has been fitted with parameter values \(\hat{\boldsymbol \theta} = (\hat{p},\hat{\mu}_1, \hat{\mu}_2, \widehat{\sigma^2_1}, \widehat{\sigma^2_2} )^T\). Then for each sample \(x_i\) we can get probabilistic prediction about group assocation of each sample by \[ p(y_i | x_i, \hat{\boldsymbol \theta}) = \frac{\hat{\pi}_{y_i} N(x_i| \hat{\mu}_{y_i}, \widehat{\sigma^2_{y_i}})}{\hat{p} N(x_i| \hat{\mu}_1, \widehat{\sigma^2_1}) + (1-\hat{p}) N(x_i | \hat{\mu}_2, \widehat{\sigma^2_2})} \]
8.3 EM Algorithm
Computing and maximising the observed data log-likelihood can be difficult because of the integration over the unobserved variable (or summation in case of a discrete latent variable). In contrast, the complete data log-likelihood function may be easier to compute.
The widely used EM algorithm, formally described by Dempster and others (1977) but also used before, addresses this problem and maximises the observed data log-likelihood indirectly in an iterative procedure comprising two steps:
- First (“E” step), the missing data \(D_y\) is imputed using Bayes’ theorem. This provides probabilities (“soft allocations”) for each possible state of the latent variable.
- Subsequently (“M” step), the expected complete data log-likelihood function is computed, where the expectation is taken with regard to the distribution over the latent states, and it is maximised with regard to \(\boldsymbol \theta\) to estimate the model parameters.
The EM algorithm leads to the exact same estimates as if the observed data log-likelihood would be optimised directly. Therefore the EM algorithm is in fact not an approximation, it is just a different way to find the MLEs.
The EM algorithm and application to clustering is discussed in more detail in the module MATH38161 Multivariate Statistics and Machine Learning.
In a nutshell, the justication for the EM algorithm follows from the entropy chain rules and the corresponding bounds, such as \(D_{\text{KL}}(Q_{x,y} , P_{x, y}) \geq D_{\text{KL}}(Q_{x} , P_{x})\) (see previous chapter). Given observed data for \(x\) we know the empirical distribution \(\hat{Q}_x\). Hence, by minimising \(D_{\text{KL}}( \hat{Q}_{x} Q_{y| x}, P_{x, y}^{\boldsymbol \theta})\) iteratively
- with regard to \(Q_{y| x}\) (“E” step) and
- with regard to the parameters \(\boldsymbol \theta\) of \(P_{x, y}^{\boldsymbol \theta}\) (“M” step”)
one minimises \(D_{\text{KL}}(\hat{Q}_{x} , P_{x}^{\boldsymbol \theta})\) with regard to the parameters of \(P_{x}^{\boldsymbol \theta}\).
Interestingly, in the “E” step the first argument of the KL divergence is optimised (“I” projection) and in the “M” step the second argument (“M” projection).
Alternatively, instead of bounding the marginal KL divergence one can also either minimise the upper bound of the cross-entropy or maximise the lower bound of the negative cross-entropy. All of these three procedures yield the same EM algorithm.
Note that the optimisation of the entropy bound in the “E” step requires variational calculus since the argument is a distribution! The EM algorithm is therefore in fact a special case of a variational Bayes algorithm since it not only provides estimates of \(\boldsymbol \theta\) but also yields the distribution of the latent states by means of the calculus of variations.
Finally, in the above we see that we can learn about unobservable states by means of Bayes theorem. By extending this same principle to learning about parameters and models we arrive at Bayesian learning.