8  Maximum likelihood estimation in practise

Next, maximum likelihood estimation is illustrated on a number of examples. Among others we discuss three basic problems, namely how to estimate a proportion, the mean and the variance in the likelihood framework. We also consider an example of non-regular model (Example 8.4).

8.1 Likelihood estimation for a single parameter

In the following we illustrate likelihood estimation for models with a single parameter. In this case the score function and the second derivative of the log-likelihood are all scalar-valued like the log-likelihood function itself.

Example 8.1 Maximum likelihood estimation for the Bernoulli model:

We aim to estimate the true proportion \(\theta\) in a Bernoulli experiment with binary outcomes, say the proportion of “successes” vs. “failures” or of “heads” vs. “tails” in a coin tossing experiment.

  • Bernoulli model \(\text{Ber}(\theta)\): \(\text{Pr}(\text{"success"}) = \theta\) and \(\text{Pr}(\text{"failure"}) = 1-\theta\).
  • The “success” is indicated by outcome \(x=1\) and the “failure” by \(x=0\).
  • We conduct \(n\) trials and record \(n_1\) successes and \(n-n_1\) failures.
  • Parameter: \(\theta\) probability of “success”.

What is the MLE of \(\theta\)?

  • the observations \(D=\{x_1, \ldots, x_n\}\) take on values 0 or 1.

  • the average of the data points is \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{n_1}{n}\).

  • the probability mass function (PMF) of the Bernoulli distribution \(\text{Ber}(\theta)\) is: \[ p(x| \theta) = \theta^x (1-\theta)^{1-x} = \begin{cases} \theta & \text{if $x=1$ }\\ 1-\theta & \text{if $x=0$} \\ \end{cases} \]

  • log-PMF: \[ \log p(x| \theta) = x \log(\theta) + (1-x) \log(1 - \theta) \]

  • log-likelihood function: \[ \begin{split} l_n(\theta| D) & = \sum_{i=1}^n \log p(x_i| \theta) \\ & = n_1 \log \theta + (n-n_1) \log(1-\theta) \\ & = n \left( \bar{x} \log \theta + (1-\bar{x}) \log(1-\theta) \right) \\ \end{split} \] Note that the log-likelihood depends on the data only via \(\bar{x}\). Thus, \(t(D) = \bar{x}\) is a sufficient statistic for the parameter \(\theta\). In fact it is also a minimally sufficient statistic as will be discussed in more detail later.

  • Score function: \[ S_n(\theta)= \frac{dl_n(\theta| D)}{d\theta}= n \left( \frac{\bar{x}}{\theta}-\frac{1-\bar{x}}{1-\theta} \right) \]

  • Maximum likelihood estimate: Setting \(S_n(\hat{\theta}_{ML})=0\) yields as solution \[ \hat{\theta}_{ML} = \bar{x} = \frac{n_1}{n} \]

    With \(\frac{dS_n(\theta)}{d\theta} = -n \left( \frac{\bar{x}}{\theta^2} + \frac{1-\bar{x}}{(1-\theta)^2} \right) <0\) the optimum corresponds indeed to the maximum of the (log-)likelihood function as this is negative for \(\hat{\theta}_{ML}\) (and indeed for any \(\theta\)).

    The maximum likelihood estimator of \(\theta\) is therefore identical to the frequency of the successes among all observations.

Note that to analyse the coin tossing experiment and to estimate \(\theta\) we may equally well use the binomial distribution \(\text{Bin}(n, \theta)\) as model for the number of successes. This results in the same MLE for \(\theta\) but the likelihood function based on the binomial PMF includes the binomial coefficient. However, as it does not depend on \(\theta\) it disappears in the score function and has no influence in the derivation of the MLE.

Example 8.2 Maximum likelihood estimation for the normal distribution with unknown mean and known variance:

  • \(x \sim N(\mu,\sigma^2)\) with \(\text{E}(x)=\mu\) and \(\text{Var}(x) = \sigma^2\)
  • the parameter to be estimated is \(\mu\) whereas \(\sigma^2\) is known.

What’s the MLE of the parameter \(\mu\)?

  • the data \(D= \{x_1, \ldots, x_n\}\) are all real in the range \(x_i \in [-\infty, \infty]\).

  • the average \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\) is real as well.

  • Density: \[ p(x| \mu)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

  • Log-Density: \[\log p(x| \mu) =-\frac{1}{2} \log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}\]

  • Log-likelihood function: \[ \begin{split} l_n(\mu| D) &= \sum_{i=1}^n \log p(x_i| \mu)\\ &=-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \underbrace{-\frac{n}{2}\log(2 \pi \sigma^2) }_{\text{constant term, does not depend on } \mu \text{, can be removed}}\\ &=-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i^2 - 2 x_i \mu+\mu^2) + \text{ const.}\\ &=\frac{n}{\sigma^2} ( \bar{x} \mu - \frac{1}{2}\mu^2) \underbrace{ - \frac{1}{2\sigma^2}\sum_{i=1}^n x_i^2 }_{\text{another constant term}} + \text{ const.}\\ \end{split} \] Note how the non-constant terms of the log-likelihood depend on the data only through \(\bar{x}\). Hence \(t(D) =\bar{x}\) this is a sufficient statistic for \(\mu\).

  • Score function: \[ S_n(\mu) = \frac{n}{\sigma^2} ( \bar{x}- \mu) \]

  • Maximum likelihood estimate: \[S_n(\hat{\mu}_{ML})=0 \Rightarrow \hat{\mu}_{ML} = \bar{x}\]

  • With \(\frac{dS_n(\mu)}{d\mu} = -\frac{n}{\sigma^2}<0\) the optimum is indeed the maximum

The constant term in the log-likelihood function collects all terms that do not depend on the parameter of interest. After taking the first derivative with regard to the parameter the constant term disappears thus it has no influence in maximum likelhood estimation. Therefore constant terms can be dropped from the log-likelihood function.

Example 8.3 Maximum likelihood estimation for the normal distribution with known mean and unknown variance:

  • \(x \sim N(\mu,\sigma^2)\) with \(\text{E}(x)=\mu\) and \(\text{Var}(x) = \sigma^2\)
  • \(\sigma^2\) needs to be estimated whereas the mean \(\mu\) is known

What’s the MLE of \(\sigma^2\)?

  • the data \(D= \{x_1, \ldots, x_n\}\) are all real in the range \(x_i \in [-\infty, \infty]\).

  • the average of the squared centred data \(\overline{(x-\mu)^2} = \frac{1}{n} \sum_{i=1}^n (x_i-\mu)^2 \geq 0\) is non-negative.

  • Density: \[ p(x| \sigma^2)=(2\pi\sigma^2)^{-\frac{1}{2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

  • Log-Density: \[\log p(x | \sigma^2) =-\frac{1}{2} \log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}\]

  • Log-likelihood function: \[ \begin{split} l_n(\sigma^2 | D) & = l_n(\mu, \sigma^2 | D) = \sum_{i=1}^n \log p(x_i| \sigma^2)\\ &= -\frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \underbrace{-\frac{n}{2} \log(2 \pi) }_{\text{constant not depending on } \sigma^2}\\ &= -\frac{n}{2}\log(\sigma^2)-\frac{n}{2\sigma^2} \overline{(x-\mu)^2} + C\\ \end{split} \] Note how the log-likelihood function depends on the data only through \(\overline{(x-\mu)^2}\). Hence \(t(D) = \overline{(x-\mu)^2}\) is a sufficient statistic for \(\sigma^2\).

  • Score function: \[ S_n(\sigma^2) = -\frac{n}{2\sigma^2}+\frac{n}{2\sigma^4} \overline{(x-\mu)^2} \]

    Note that to obtain the score function the derivative needs to be taken with regard to the variance parameter \(\sigma^2\) — not with regard to \(\sigma\)! As a trick, relabel \(\sigma^2 = v\) in the log-likelihood function, then take the derivative with regard to \(v\), then backsubstitute \(v=\sigma^2\) in the final result.

  • Maximum likelihood estimate: \[ S_n(\widehat{\sigma^2}_{ML})=0 \Rightarrow \] \[ \widehat{\sigma^2}_{ML} =\overline{(x-\mu)^2} = \frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2 \]

  • To confirm that we actually have maximum we need to verify that the second derivative of log-likelihood at the optimum is negative. With \(\frac{dS_n(\sigma^2)}{d\sigma^2} = -\frac{n}{2\sigma^4} \left(\frac{2}{\sigma^2} \overline{(x-\mu)^2} -1\right)\) and hence \(\frac{dS_n( \widehat{\sigma^2}_{ML} )}{d\sigma^2} = -\frac{n}{2} \left(\widehat{\sigma^2}_{ML} \right)^{-2}<0\) the optimum is indeed the maximum.

Example 8.4 Uniform distribution with upper bound \(\theta\):

This is an example of a non-regular model, as the parameter \(\theta\) determines the support of the model.

  • \(x \sim U(0,\theta)\) with \(\theta > 0\)
  • the data \(D= \{x_1, \ldots, x_n\}\) are all real in the range \(x_i \in [0, \theta]\).
  • by \(x_{[i]}\) we denote the ordered observations with \(0 \leq x_{[1]} < x_{[2]} < \ldots < x_{[n]} \leq \theta\) with \(x_{[n]} = \max(x_1,\dots,x_n)\).

We would like to obtain the maximum likelihood estimator \(\hat{\theta}_{ML}\).

  • The probability density function of \(U(0,\theta)\) is \[p(x|\theta) =\begin{cases} \frac{1}{\theta} &\text{if } x \in [0,\theta] \\ 0 & \text{otherwise.} \end{cases} \]

  • the corresponding the log-density is \[ \log p(x|\theta) =\begin{cases} - \log \theta &\text{if } x \in [0,\theta] \\ - \infty & \text{otherwise.} \end{cases} \]

  • the log-likelihood function is \[ l_n(\theta| D) =\begin{cases} -n\log \theta &\text{for } \theta \geq x_{[n]}\\ - \infty & \text{otherwise} \end{cases} \] since all observed data \(D =\{x_1, \ldots, x_n\}\) lie in the interval \([0,\theta]\). Note that the log-likelihood is a function of \(x_{[n]}\) only so this single data point is the sufficient statistic \(t(D) =x_{[n]}\).

  • the log-likelihood function remains at value \(-\infty\) until \(\theta = x_{[n]}\), where it jumps to \(-n\log x_{[n]}\) and then it decreases monotonically with increasing \(\theta > x_{[n]}\). Hence the log-likelihood function has a maximum at \(\hat{\theta}_{ML}=x_{[n]}\).

  • Due to the discontinuity at \(x_{[n]}\) the log-likelihood \(l_n(\theta| D)\) is not differentiable at \(\hat{\theta}_{ML}\). and hence the maximum cannot be found by setting the score function equal to zero as in a regular model.

  • In addition, there is no quadratic approximation around \(\hat{\theta}_{ML}\) and therefore the observed Fisher information cannot be computed either.

8.2 Likelihood estimation for multiple parameters

If there are several parameters likelihood estimation is conceptually no different from the case of a single parameter. However, the score function is now vector-valued and the second derivative of the log-likelihood is a matrix-valued function.

Example 8.5 Normal distribution with mean and variance both unknown:

  • \(x \sim N(\mu,\sigma^2)\) with \(\text{E}(x)=\mu\) and \(\text{Var}(x) = \sigma^2\)
  • both \(\mu\) and \(\sigma^2\) need to be estimated.

What’s the MLE of the parameter vector \(\boldsymbol \theta= (\mu,\sigma^2)^T\)?

  • the data \(D= \{x_1, \ldots, x_n\}\) are all real in the range \(x_i \in [-\infty, \infty]\).

  • the average \(\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\) is real as well.

  • the average of the squared data \(\overline{x^2} = \frac{1}{n} \sum_{i=1}^n x_i^2 \geq 0\) is non-negative.

  • Density: \[ f(x| \mu, \sigma^2)=(2\pi\sigma^2)^{-\frac{1}{2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

  • Log-Density: \[\log f(x | \mu, \sigma^2) =-\frac{1}{2} \log(2\pi\sigma^2) - \frac{(x-\mu)^2}{2\sigma^2}\]

  • Log-likelihood function: \[ \begin{split} l_n(\boldsymbol \theta| D) & = l_n(\mu, \sigma^2 | D) = \sum_{i=1}^n \log f(x_i| \mu, \sigma^2)\\ &= -\frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \underbrace{-\frac{n}{2} \log(2 \pi) }_{\text{constant not depending on }\mu \text{ or } \sigma^2}\\ &= -\frac{n}{2}\log(\sigma^2)-\frac{n}{2\sigma^2} ( \overline{x^2} -2 \bar{x} \mu + \mu^2) + C\\ \end{split} \] Note how the log-likelihood function depends on the data only through \(\bar{x}\) and \(\overline{x^2}\). Hence, \(\boldsymbol t(D) = (\bar{x}, \overline{x^2})^T\) are sufficient statistics for \(\boldsymbol \theta\).

  • Score function \(\boldsymbol S_n\), gradient of \(l_n(\boldsymbol \theta| D)\): \[ \begin{split} \boldsymbol S_n(\boldsymbol \theta) &= \boldsymbol S_n(\mu,\sigma^2) \\ & =\nabla l_n(\mu, \sigma^2| D) \\ &= \begin{pmatrix} \frac{n}{\sigma^2} (\bar{x}-\mu) \\ -\frac{n}{2\sigma^2}+\frac{n}{2\sigma^4} \left( \overline{x^2} - 2\bar{x} \mu +\mu^2 \right) \\ \end{pmatrix}\\ \end{split} \]

  • Maximum likelihood estimate: \[ \boldsymbol S_n(\hat{\boldsymbol \theta}_{ML})=0 \Rightarrow \] \[ \hat{\boldsymbol \theta}_{ML}= \begin{pmatrix} \hat{\mu}_{ML} \\ \widehat{\sigma^2}_{ML} \\ \end{pmatrix} = \begin{pmatrix} \bar{x} \\ \overline{x^2} -\bar{x}^2\\ \end{pmatrix} \] The ML estimate of the variance can also be written \(\widehat{\sigma^2}_{ML} = \overline{x^2} -\bar{x}^2 =\overline{(x-\bar{x})^2} = \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})^2\).

  • To confirm that we actually have a maximum we need to verify that the eigenvalues of the Hessian matrix at the optimum are all negative. This is indeed the case, for details see Example 9.4.

Example 8.6 \(\color{Red} \blacktriangleright\) Maximum likelihood estimates of the parameters of the multivariate normal distribution:

The results from Example 8.5 can be generalised to the multivariate normal distribution:

  • \(\boldsymbol x\sim N(\boldsymbol \mu,\boldsymbol \Sigma)\) with \(\text{E}(\boldsymbol x)=\boldsymbol \mu\) and \(\text{Var}(\boldsymbol x) = \boldsymbol \Sigma\)
  • both \(\boldsymbol \mu\) and \(\boldsymbol \Sigma\) need to be estimated.

With

  • the data \(D= \{\boldsymbol x_1, \ldots, \boldsymbol x_n\}\) containing real vector-valued observations,

the maximum likelihood can be written as follows:

MLE for the mean: \[ \hat{\boldsymbol \mu}_{ML} = \frac{1}{n}\sum^{n}_{k=1} \boldsymbol x_k = \bar{\boldsymbol x} \]

MLE for the covariance: \[ \underbrace{\widehat{\boldsymbol \Sigma}_{ML}}_{d \times d} = \frac{1}{n}\sum^{n}_{k=1} \underbrace{\left(\boldsymbol x_k-\bar{\boldsymbol x}\right)}_{d \times 1} \; \underbrace{\left(\boldsymbol x_k-\bar{\boldsymbol x}\right)^T}_{1 \times d}\] Note the factor \(\frac{1}{n}\) in the estimator of the covariance matrix.

With \(\overline{\boldsymbol x\boldsymbol x^T} = \frac{1}{n}\sum^{n}_{k=1} \boldsymbol x_k \boldsymbol x_k^T\) we can also write \[ \widehat{\boldsymbol \Sigma}_{ML} = \overline{\boldsymbol x\boldsymbol x^T} - \bar{\boldsymbol x} \bar{\boldsymbol x}^T \]

Hence, the MLEs correspond to the well-known empirical estimates.

The derivation of the MLEs is discussed in more detail in the module MATH38161 Multivariate Statistics and Machine Learning.

Example 8.7 \(\color{Red} \blacktriangleright\) Maximum likelihood estimation of the parameters of the categorical distribution:

Maximum likelihood estimation of the parameters of \(\text{Cat}(\boldsymbol \pi)\) at first seems a trivial extension of the Bernoulli model (cf. Example 8.1) but this a bit more complicated because of the constraint on the allowed values of \(\boldsymbol \pi\) so there are only \(K-1\) free parameters and not \(K\). Hence we either need to optimise with regard to a specific set of \(K-1\) parameters (which is what we do below) or use a constrained optimisation procedure to enforce that \(\sum_{k=1}^K \pi_k = 1\) (e.g using Lagrange multipliers).

  • The data: We observe \(n\) samples \(\boldsymbol x_1, \ldots, \boldsymbol x_n\). The data matrix of dimension \(n \times K\) is \(\boldsymbol X= (\boldsymbol x_1, \ldots, \boldsymbol x_n)^T = (x_{ik})\). It contains each \(\boldsymbol x_i = (x_{i1}, \ldots, x_{iK})^T\). The corresponding summary (minimal sufficient) statistics are \(\boldsymbol t(D) = \bar{\boldsymbol x} = \frac{1}{n} \sum_{i=1}^n \boldsymbol x_i = (\bar{x}_1, \ldots, \bar{x}_K)^T\) with \(\bar{x}_k = \frac{1}{n} \sum_{i=1}^n x_{ik}\). We can also write \(\bar{x}_{K} = 1 - \sum_{k=1}^{K-1} \bar{x}_{k}\). The number of samples for class \(k\) is \(n_k = n \bar{x}_k\) with \(\sum_{k=1}^K n_k = n\).

  • The log-likelihood is \[ \begin{split} l_n(\pi_1, \ldots, \pi_{K-1}) & = \sum_{i=1}^n \log f(\boldsymbol x_i) \\ & =\sum_{i=1}^n \left( \sum_{k=1}^{K-1} x_{ik} \log \pi_k + \left( 1 - \sum_{k=1}^{K-1} x_{ik} \right) \log \left( 1 - \sum_{k=1}^{K-1} \pi_k \right) \right)\\ & = n \left( \sum_{k=1}^{K-1} \bar{x}_k \log \pi_k + \left( 1 - \sum_{k=1}^{K-1} \bar{x}_{k} \right) \log\left(1 - \sum_{k=1}^{K-1} \pi_k\right) \right) \\ & = n \left( \sum_{k=1}^{K-1} \bar{x}_k \log \pi_k + \bar{x}_K \log \pi_K \right) \\ \end{split} \]

  • Score function (gradient) \[ \begin{split} \boldsymbol S_n(\pi_1, \ldots, \pi_{K-1}) &= \nabla l_n(\pi_1, \ldots, \pi_{K-1} ) \\ & = \begin{pmatrix} \frac{\partial}{\partial \pi_1} l_n(\pi_1, \ldots, \pi_{K-1} ) \\ \vdots\\ \frac{\partial}{\partial \pi_{K-1}} l_n(\pi_1, \ldots, \pi_{K-1} ) \\ \end{pmatrix}\\ & = n \begin{pmatrix} \frac{\bar{x}_1}{\pi_1}-\frac{\bar{x}_K}{\pi_K} \\ \vdots\\ \frac{\bar{x}_{K-1}}{\pi_{K-1}}-\frac{\bar{x}_K}{\pi_K} \\ \end{pmatrix}\\ \end{split} \] Note in particular the need for the second term that arises because \(\pi_K\) depends on all the \(\pi_1, \ldots, \pi_{K-1}\).

  • Maximum likelihood estimate: Setting \(\boldsymbol S_n(\hat{\pi}_1^{ML}, \ldots, \hat{\pi}_{K-1}^{ML})=0\) yields \(K-1\) equations \[ \bar{x}_i \left(1 - \sum_{k=1}^{K-1} \hat{\pi}_k^{ML}\right) = \hat{\pi}_i^{ML} \left( 1 - \sum_{k=1}^{K-1} \bar{x}_{k} \right) \] for \(i=1, \ldots, K-1\) and with solution \[ \hat{\pi}_i^{ML} = \bar{x}_i \] It also follows that \[ \hat{\pi}_K^{ML} = 1 - \sum_{k=1}^{K-1} \hat{\pi}_k^{ML} = 1 - \sum_{k=1}^{K-1} \bar{x}_{k} = \bar{x}_K \] The maximum likelihood estimator is therefore the frequency of the occurrence of a class among the \(n\) samples.

8.3 Further properties of ML

Relationship of maximum likelihood with least squares estimation

In Example 8.2 the form of the log-likelihood function is a function of the sum of squared differences. Maximising \(l_n(\mu| D) =-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\) is equivalent to minimising \(\sum_{i=1}^n(x_i-\mu)^2\). Hence, finding the mean by maximum likelihood assuming a normal model is equivalent to least-squares estimation!

Note that least-squares estimation has been in use at least since the early 1800s 1 and thus predates maximum likelihood (1922). Due to its simplicity it is still very popular in particular in regression and the link with maximum likelihood and normality allows to understand why it usually works well.

See also Example 3.3 and Example 4.5 for further links of the normal distribution with squared error.

Bias of maximum likelihood estimates

Example 8.5 is interesting because it shows that maximum likelihood can result in both biased and as well as unbiased estimators.

Recall that \(x \sim N(\mu, \sigma^2)\). As a result \[\hat{\mu}_{ML}=\bar{x} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\] with \(\text{E}( \hat{\mu}_{ML} ) = \mu\) and \[ \widehat{\sigma^2}_{\text{ML}} \sim W_1\left(s^2 = \frac{\sigma^2}{n}, k=n-1\right) \] (see Section A.8) with mean \(\text{E}(\widehat{\sigma^2}_{ML}) = \frac{n-1}{n} \, \sigma^2\).

Therefore, the MLE of \(\mu\) is unbiased as
\[ \text{Bias}(\hat{\mu}_{ML}) = \text{E}( \hat{\mu}_{ML} ) - \mu = 0 \] In contrast, however, the MLE of \(\sigma^2\) is negatively biased because \[ \text{Bias}(\widehat{\sigma^2}_{ML}) = \text{E}( \widehat{\sigma^2}_{ML} ) - \sigma^2 = -\frac{1}{n} \, \sigma^2 \]

Thus, in the case of the variance parameter of the normal distribution the MLE is not recovering the well-known unbiased estimator of the variance
\[ \widehat{\sigma^2}_{UB} = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2 = \frac{n}{n-1} \widehat{\sigma^2}_{ML} \] In other words, the unbiased variance estimate is not a maximum likelihood estimate!

Therefore it is worth keeping in mind that maximum likelihood can result in biased estimates for finite \(n\). For large \(n\), however, the bias disappears as MLEs are consistent.

\(\color{Red} \blacktriangleright\) Minimal sufficient statistics and maximal data reduction

In all the examples discussed above the sufficient statistic was typically either \(\bar{x}\) and \(\overline{x^2}\) (or both). This is not a coincidence since all of the examples are exponential families with canonical statistics \(x\) and \(x^2\), and in exponential families a sufficient statistic can be obtained as the average of the canonical statistics.

Crucially, in the above examples the identified sufficient statistics are also minimal sufficient statistics where the dimension of sufficient statistic is equal to the dimension of the parameter vector, and as such as low as possible. Minimal sufficient statistics provide maximal data reduction as will be discussed later.


  1. Stigler, S. M. 1981. Gauss and the invention of least squares. Ann. Statist. 9:465–474. https://doi.org/10.1214/aos/1176345451↩︎