7 Principle of maximum likelihood
7.1 Overview
Outline of maximum likelihood estimation
Maximum likelihood is a very general method for fitting probabilistic models to data that generalises the method of least-squares. It plays a very important role in statistics and was mainly developed by R.A. Fisher in the early 20th century. 1
In a nutshell, the starting points in a maximum likelihood analysis are
- the observed data \(D = \{x_1,\ldots,x_n\}\) with \(n\) independent and identically distributed (iid) samples, with the ordering irrelevant, and a
- a model \(P_{\boldsymbol \theta}\) with corresponding probability density or probability mass function \(p(x|\boldsymbol \theta)\) and parameters \(\boldsymbol \theta\)
From model and data the likelihood function (note upper case “L”) is constructed as \[ L_n(\boldsymbol \theta|D)=\prod_{i=1}^{n} p(x_i|\boldsymbol \theta) \] Equivalently, the log-likelihood function (note lower case “l”) is \[ l_n(\boldsymbol \theta|D)=\sum_{i=1}^n \log p(x_i|\boldsymbol \theta) \] The likelihood is multiplicative and the log-likelihood additive over the samples \(x_i\).
The maximum likelihood estimate (MLE) \(\hat{\boldsymbol \theta}^{ML}\) is then found by maximising the (log)-likelihood function with regard to the parameters \(\boldsymbol \theta\) (see Figure 7.1): \[ \hat{\boldsymbol \theta}_{ML} = \text{arg max}\, l_n(\boldsymbol \theta|D) \]
Hence, once the model is chosen and data are collected, finding the MLE and thus fitting the model to the data is an optimisation problem.
Depending on the complexity of the likelihood function and the number of the parameters finding the maximum likelihood can be very difficult. On the other hand, for likelihood functions constructed from common distribution families, such as exponential families, maximum likelihood estimation is very straightforward and can even be done analytically (this is the case for most examples we encounter in this course).
In practise in application to more complex models the optimisation required for maximum likelihood analysis is done on the computer, typically on the log-likelihood rather than on the likelihood function in order to avoid problems with the computer representation of small floating point numbers. Suitable optimisation algorithm may rely only on function values without requiring derivatives, or use in addition gradient and possibly curvature information. In recent years there has been a lot of progress in high-dimensional optimisation using combined numerical and analytical approaches (e.g. using automatic differentiation) and stochastic approximations (e.g. stochastic gradient descent).
Origin of the method of maximum likelihood
Historically, the likelihood has often interpreted and justified as the probability of the data given the model. However, while providing an intuitive understanding this is not strictly correct. First, this interpretation only applies to discrete random variables. Second, since the samples \(x_1, \ldots, x_n\) are typically exchangeable (i.e. permutation invariant) even in this case one would still need to add a factor accounting for the multiplicity of the possible orderings of the samples to obtain the correct probability of the data. Third, the interpretation of likelihood as probability of the data completely breaks down for continuous random variables because then \(p(x |\boldsymbol \theta)\) is a density, not a probability.
Next, we will see that maximum likelihood estimation is a well-justified method that arises naturally from an entropy perspective. More specifically, the maximum likelihood estimate corresponds to the distribution \(P_{\boldsymbol \theta}\) that is closest in terms of KL divergence to the unknown true data generating model as represented by the observed data and the empirical distribution.
7.2 From entropy learning to maximum likelihood
The KL divergence between true model and approximating model
Assume we have observations \(D = \{x_1, \ldots, x_n\}\). The data are sampled from \(F\), the true but unknown data generating distribution. We also specify a family of distributions \(P_{\boldsymbol \theta}\) indexed by \(\boldsymbol \theta\) to approximate \(F\).
The KL divergence \(D_{\text{KL}}(F,P_{\boldsymbol \theta})\) measures the divergence of the approximation \(P_{\boldsymbol \theta}\) from the unknown true model \(F\). It can be written as \[ \begin{split} D_{\text{KL}}(F,P_{\boldsymbol \theta}) &= H(F,P_{\boldsymbol \theta}) - H(F) \\ &= \underbrace{- \text{E}_{F}\log p_{\boldsymbol \theta}(x)}_{\text{cross-entropy}} -(\underbrace{-\text{E}_{F}\log f(x)}_{\text{entropy of $F$, does not depend on $\boldsymbol \theta$}})\\ \end{split} \]
However, since we do not know \(F\) we cannot actually compute this divergence. Nonetheless, we may use the empirical distribution \(\hat{F}_n\) — a function of the observed data — as approximation for \(F\), and in this way we arrive at an approximation for \(D_{\text{KL}}(F,P_{\boldsymbol \theta})\) that becomes more and more accurate with growing sample size.
Recall the “Law of Large Numbers” :
The empirical distribution \(\hat{F}_n\) based on observed data \(D=\{x_1, \ldots, x_n\}\) converges strongly (almost surely) to the true underlying distribution \(F\) as \(n \rightarrow \infty\): \[ \hat{F}_n\overset{a. s.}{\to} F \]
Correspondingly, for \(n \rightarrow \infty\) the average \(\text{E}_{\hat{F}_n}(h(x)) = \frac{1}{n} \sum_{i=1}^n h(x_i)\) converges to the expectation \(\text{E}_{F}(h(x))\).
Hence, for large sample size \(n\) we can approximate cross-entropy and as a result the KL divergence. The cross-entropy \(H(F, P_{\boldsymbol \theta})\) is approximated by the empirical cross-entropy where the expectation is taken with regard to \(\hat{F}_n\) rather than \(F\): \[ \begin{split} H(F, P_{\boldsymbol \theta}) & \approx H(\hat{F}_n, P_{\boldsymbol \theta}) \\ & = - \text{E}_{\hat{F}_n} (\log p(x|\boldsymbol \theta)) \\ & = -\frac{1}{n} \sum_{i=1}^n \log p(x_i | \boldsymbol \theta) \\ & = -\frac{1}{n} l_n ({\boldsymbol \theta}| D) \end{split} \] The empirical cross-entropy is equal to the negative log-likelihood standardised by the sample size \(n\). Conversely, the log-likelihood is the negative empirical cross-entropy multiplied by sample size \(n\).
Minimum KL divergence and maximum likelihood
If we knew \(F\) we would simply minimise \(D_{\text{KL}}(F, P_{\boldsymbol \theta})\) to find the particular model \(P_{\boldsymbol \theta}\) that is closest to the true model, or equivalent, we would minimise the cross-entropy \(H(F, P_{\boldsymbol \theta})\). However, since we actually don’t know \(F\) this is not possible.
However, for large sample size \(n\) when the empirical distribution \(\hat{F}_n\) is a good approximation for \(F\), we can use the results from the previous section. Thus, instead of minimising the KL divergence \(D_{\text{KL}}(F, P_{\boldsymbol \theta})\) we simply minimise \(H(\hat{F}_n, P_{\boldsymbol \theta})\) which is the same as maximising the log-likelihood \(l_n ({\boldsymbol \theta}| D)\).
Conversely, this implies that maximising the likelihood with regard to the \(\boldsymbol \theta\) is equivalent ( asymptotically for large \(n\)!) to minimising the KL divergence of the approximating model and the unknown true model.
\[ \begin{split} \hat{\boldsymbol \theta}^{ML} &= \underset{\boldsymbol \theta}{\arg \max}\,\, l_n(\boldsymbol \theta| D) \\ &= \underset{\boldsymbol \theta}{\arg \min}\,\, H(\hat{F}_n, P_{\boldsymbol \theta}) \\ &\approx \underset{\boldsymbol \theta}{\arg \min}\,\, D_{\text{KL}}(F, P_{\boldsymbol \theta}) \\ \end{split} \]
Therefore, the reasoning behind the method of maximum likelihood is that it minimises a large sample approximation of the KL divergence of the candidate model \(P_{\boldsymbol \theta}\) from the unkown true model \(F\). In other words, maximum likelihood estimators are minimum empirical KL divergence estimators.
As the KL divergence is a functional of the true distribution \(F\) maximum likelihood provides empirical estimators for parametric models.
As a consequence of the close link of maximum likelihood and KL divergence maximum likelihood inherits for large \(n\) (and only then!) all the optimality properties from KL divergence.
7.3 Properties of maximum likelihood estimation
Consistency of maximum likelihood estimates
One important property of the method of maximum likelihood is that in general it produces consistent estimates. This means that estimates are well behaved so that they become more accurate with more data and in the limit of infinite data converge to the true parameters.
Specifically, if the true underlying model \(F_{\text{true}}\) is contained in the set of specified candidates models \(P_{\boldsymbol \theta}\) \[ \underbrace{F_{\text{true}}}_{\text{true model}} \subset \underbrace{P_{\boldsymbol \theta}}_{\text{specified models}} \] so that there is a parameter \(\boldsymbol \theta_{\text{true}}\) for which \(F_{\text{true}} = P_{\boldsymbol \theta_{\text{true}}}\), then \[\hat{\boldsymbol \theta}_{ML} \overset{\text{large }n}{\longrightarrow} \boldsymbol \theta_{\text{true}} \]
This is a consequence of \(D_{\text{KL}}(F_{\text{true}},P_{\boldsymbol \theta})\rightarrow 0\) for \(P_{\boldsymbol \theta} \rightarrow F_{\text{true}}\), and that maximisation of the likelihood function is for large \(n\) equivalent to minimising the KL divergence.
Thus given sufficient data the maximum likelihood estimate of the parameters of the model will converge to the true value of the parameters. Note that this also assumes that the model and in particular the number of parameters is fixed. As a consequence of consistency, maximum likelihood estimates are asympotically unbiased. As we will see in the examples they can still be biased in finite samples.
Note that even if the candidate model family \(P_{\boldsymbol \theta}\) is misspecified (so that it does not contain the actual true model), the maximum likelihood estimate is still optimal in the sense in that it will identify the model in the model family that is closest in terms of empirical KL divergence.
Finally, it is possible to find inconsistent maximum likelihood estimates, but this occurs only in situations when the MLE lies at a boundary or when there are singularities in the likelihood function (in both cases the models are not regular). Furthermore, models are inconsistent by construction when the number of parameters grows with sample size (e.g. in the famous Neyman-Scott paradox) as the data available per parameter does not decrease.
Invariance property of the maximum likelihood
The maximum likelihood invariance principle states that the achieved maximum likelihood is invariant against reparameterisation of the model parameters. This property is shared by the KL divergence minimisation procedure as the achieved minimum KL divergence is also invariant against the change of parameters.
Recall that the model parameter is just an arbitrary label to index a specific distribution within a distribution family, and changing that label does not affect the maximum (likelihood) or the minimum (KL divergence). For example, consider a function \(h_x(x)\) with a maximum at \(x_{\max} = \text{arg max } h_x(x)\). Now we relabel the argument using \(y = g(x)\) where \(g\) is an invertible function. Then the function in terms of \(y\) is \(h_y(y) = h_x( g^{-1}(y))\). and clearly this function has a maximum at \(y_{\max} = g(x_{\max})\) since \(h_y(y_{\max}) = h_x(g^{-1}(y_{\max} ) ) = h_x( x_{\max} )\). Furthermore, the achieved maximum value is the same.
In application to maximum likelihood, assume we transform a parameter \(\theta\) into another parameter \(\omega\) using some invertible function \(g()\) so that \(\omega= g(\theta)\). Then the maximum likelihood estimate \(\hat{\omega}_{ML}\) of the new parameter \(\omega\) is simply the transformation of the maximum likelihood estimate \(\hat{\theta}_{ML}\) of the original parameter \(\theta\) with \(\hat{\omega}_{ML}= g( \hat{\theta}_{ML})\). The achieved maximum likelihood is the same in both cases.
The invariance property can be very useful in practise because it is often easier (and sometimes numerically more stable) to maximise the likelihood for a different set of parameters.
See Worksheet L1 for an example application of the invariance principle.
Sufficient statistics
Another important concept are so-called sufficient statistics to summarise the information available in the data about a parameter in a model.
A statistic \(\boldsymbol t(D)\) is called a sufficient statistic for the model parameters \(\boldsymbol \theta\) if the corresponding likelihood function can be written using only \(\boldsymbol t(D)\) in the terms that involve \(\boldsymbol \theta\) such that \[ L(\boldsymbol \theta| D) = h( \boldsymbol t(D) , \boldsymbol \theta) \, k(D) \,, \] where \(h()\) and \(k()\) are positive-valued functions. This is known as the Fisher-Pearson factorisation. Equivalently on log-scale this becomes \[ l_n(\boldsymbol \theta| D) = \log h( \boldsymbol t(D) , \boldsymbol \theta) + \log k(D) \,. \]
By construction, estimation and inference about \(\boldsymbol \theta\) based on the factorised likelihood \(L(\boldsymbol \theta)\) is mediated through the sufficient statistic \(\boldsymbol t(D)\) and does not require knowledge of the original data \(D\). Instead, the sufficient statistic \(\boldsymbol t(D)\) contains all the information in \(D\) required to learn about the parameter \(\boldsymbol \theta\).
Note that a sufficient statistic always exists since the data \(D\) are themselves sufficient statistics, with \(\boldsymbol t(D) = D\). However, in practise one aims to find sufficient statistics that summarise the data \(D\) and hence provide data reduction. This will become clear in the examples below.
Furthermore, sufficient statistics are not unique since applying a one-to-one transformation to \(\boldsymbol t(D)\) yields another sufficient statistic.
Therefore, if the MLE \(\hat{\boldsymbol \theta}_{ML}\) of \(\boldsymbol \theta\) exists and is unique then the MLE is a unique function of the sufficient statistic \(\boldsymbol t(D)\). If the MLE is not unique then it can be chosen to be function of \(\boldsymbol t(D)\).
7.4 Maximum likelihood estimation for regular models
Regular models
A regular model is one that is well-behaved and well-suited for model fitting by optimisation. In particular this requires that:
the support does not depend on the parameters,
the model is identifiable (in particular the model is not overparameterised and has a minimal set of parameters),
the density/probability mass function and hence the log-likelihood function is twice differentiable everywhere with regard to the parameters,
the maximum (peak) of the likelihood function lies inside the parameter space and not at a boundary,
the second derivative of the log-likelihood at the maximum is negative and not zero (for multiple parameters: the Hessian matrix at the maximum is negative definite and not singular)
Most models considered in this course are regular.
Maximum likelihood estimation in regular models
For a regular model maximum likelihood estimation and the necessary optimisation is greatly simplified by being able to using gradient and curvature information.
In order to maximise \(l_n(\boldsymbol \theta|D)\) one may use the score function \(\boldsymbol S(\boldsymbol \theta)\) which is the first derivative of the log-likelihood function with regard to the parameters 2
\[\begin{align*} \begin{array}{cc} S_n(\theta) = \frac{d l_n(\theta|D)}{d \theta}\\ \\ \\ \boldsymbol S_n(\boldsymbol \theta)=\nabla l_n(\boldsymbol \theta|D)\\ \\ \end{array} \begin{array}{ll} \text{scalar parameter $\theta$: first derivative}\\ \text{of log-likelihood function}\\ \\ \text{gradient if } \boldsymbol \theta\text{ is a vector}\\ \text{(i.e. if there's more than one parameter)}\\ \end{array} \end{align*}\]
In this case a necessary (but not sufficient) condition for the MLE is that \[ \boldsymbol S_n(\hat{\boldsymbol \theta}_{ML}) = 0 \]
To demonstrate that the log-likelihood function actually achieves a maximum at \(\hat{\boldsymbol \theta}_{ML}\) the curvature at the MLE must negative, i.e. that the log-likelihood must be locally concave at the MLE.
In the case of a single parameter (scalar \(\theta\)) this requires to check that the second derivative of the log-likelihood function with regard to the parameter is negative: \[ \frac{d^2 l_n(\hat{\theta}_{ML}| D)}{d \theta^2} <0 \] In the case of a parameter vector (multivariate \(\boldsymbol \theta\)) you need to compute the Hessian matrix (matrix of second order derivatives) at the MLE: \[ \nabla \nabla^T l_n(\hat{\boldsymbol \theta}_{ML}| D) \] and then verify that this matrix is negative definite (i.e. all its eigenvalues must be negative). For multivariate parameter vector \(\boldsymbol \theta\) of dimension \(d\) the Hessian is a matrix of size \(d \times d\).
Invariance of score function and second derivative of the log-likelihood
The score function \(\boldsymbol S_n(\boldsymbol \theta)\) is invariant against transformation of the sample space. Assume \(\boldsymbol x\) has log-density \(\log f_{\boldsymbol x}(\boldsymbol x| \boldsymbol \theta)\) then the log-density for \(\boldsymbol y\) is \[ \log f_{\boldsymbol y}(\boldsymbol y| \boldsymbol \theta) = \log |\det\left( D\boldsymbol x(\boldsymbol y) \right)| + \log f_{\boldsymbol x}\left( \boldsymbol x(\boldsymbol y)| \boldsymbol \theta\right) \] where \(D\boldsymbol x(\boldsymbol y)\) is the Jacobian matrix of the inverse transformation \(\boldsymbol x(\boldsymbol y)\). When taking the derivative of the log-likelihood function with regard to the parameter \(\boldsymbol \theta\) the first term containing the Jacobian determinant vanishes. Hence the score function \(\boldsymbol S_n(\boldsymbol \theta)\) is not affected by a change of variables.
As a consequence, the second derivative of log-likelihood function with regard to \(\boldsymbol \theta\) is also invariant against transformations of the sample space.
Aldrich J. 1997. R. A. Fisher and the Making of Maximum Likelihood 1912–1922. Statist. Sci. 12:162–176. https://doi.org/10.1214/ss/1030037906↩︎
The score function \(\boldsymbol S_n(\boldsymbol \theta)\) as the gradient of the log-likelihood function must not be confused with the scoring rule \(S(x, P)\) mentioned in the introduction to entropy and KL divergence, cf. Example 3.4.↩︎