2 Multivariate estimation
2.1 Overview
In practical application of multivariate normal model we need to learn its parameters from observed data points:
We first consider the case when there are many measurements available (
In a previous course in year 2 (see MATH27720 Statistics 2) the method of maximum likelihood as well as the essentials of Bayesian statistics were introduced. Below we apply these approaches to the problem of estimating the parameters of the multivariate normal distribution and also show how the main Bayesian modelling strategies extend to the multivariate case.
2.2 Empirical estimates
General principle
For large
We now would like to estimate
The empirical estimate is obtained by replacing the unknown true distribution
For example, the expectation of a random variable is approximated/estimated as the average over the observations:
Simple recipe to obtain an empirical estimator: simply replace the expectation operator by the sample average.
What does this work: the empirical distribution
Note: the approximation of
Empirical estimates of mean and covariance
Recall the definitions:
For the empirical estimate we replace the expectations by the corresponding sample averages.
These resulting estimators can be written in three different ways:
Vector notation:
Component notation:
The corresponding component notation with
Variance estimate:
Data matrix notation:
The empirical mean and covariance can also be written in terms of the data matrix
If the data matrix
On the other hand, if
To avoid confusion when using matrix or component notation you need to always state which convention is used! In these notes we exlusively follow the statistics convention.
2.3 Maximum likelihood estimation
General principle
R.A. Fisher (1922) 2: model-based estimators using the density or probability mass function
Log-likelihood function:
Observing data
Maximum likelihood estimate:
Maximum likelihood (ML) finds the parameters that make the observed data most likely (it does not find the most probable model!)
Recall from MATH27720 Statistics 2 that maximum likelihood is closely linked to minimising the Kullback-Leibler (KL) divergence
Correspondingly, the great appeal of maximum likelihood estimates (MLEs) is that they are optimal for large
Maximum likelihood estimates of the parameters of the multivariate normal distribution
We now derive the MLE of the parameters
Next, to obtain the MLE for
See the supplementary Matrix Refresher notes for the relevant formulas in vector and matrix calculus.
Therefore, the MLEs are identical to the empirical estimates.
Note the factor
2.4 Sampling distribution of the empirical / maximum likelihood estimates
With
1. Distribution of the estimate of the mean:
The empirical estimate of the mean is normally distributed:
2. Distribution of the covariance estimate:
The empirical and unbiased estimate of the covariance matrix is Wishart distributed:
Case of unknown mean
Since
Easy to make unbiased:
Hence
But unbiasedness of an estimator is not a very relevant criterion in multivariate statistics, especially when the number of samples is small compared to the dimension (see further below).
Covariance estimator for known mean
2.5 Small sample estimation
Problems with maximum likelihood in small sample settings and high dimensions
Modern data is high dimensional!
Data sets with
- the number of measured variables is increasing quickly with technological advances (e.g. genomics)
- but the number of samples cannot be similary increased (for cost and ethical reasons)
General problems of MLEs:
- ML estimators are optimal only if sample size is large compared to the number of parameters. However, this optimality is not any more valid if sample size is moderate or smaller than the number of parameters.
- If there is not enough data the ML estimate overfits. This means ML fits the current data perfectly but the resulting model does not generalise well (i.e. model will perform poorly in prediction)
- If there is a choice between different models with different complexity ML will always select the model with the largest number of parameters.
-> for high-dimensional data with small sample size maximum likelihood estimation does not work!!!
History of Statistics:
Much of modern statistics (from 1960 onwards) is devoted to the development of inference and estimation techniques that work with complex, high-dimensional data (cf. Figure 2.1).
- Maximum likelihood is a method from classical statistics (time up to about 1960).
- From 1960 modern (computational) statistics emerges, starting with “Stein Paradox” (1956): Charles Stein showed that in a multivariate setting ML estimators are dominated by (= are always worse than) shrinkage estimators!
- For example, there is a shrinkage estimator for the mean that is better (in terms of MSE) than the average (which is the MLE)!
Modern statistics has developed many different (but related) methods for use in high-dimensional, small sample settings:
- regularised estimators
- shrinkage estimators
- penalised maximum likelihood estimators
- Bayesian estimators
- Empirical Bayes estimators
- KL / entropy-based estimators
Most of this is out of scope for our class, but will be covered in advanced statistical courses.
Next, we describe a simple regularised estimator for the estimation of the covariance that we will use later (i.e. in classification).
Estimation of covariance matrix in small sample settings
Problems with ML estimate of
has O( ) number of parameters! requires a lot of data!if
then is positive semi-definite (even if the true is positive definite!)
will have vanishing eigenvalues (some ) and thus cannot be inverted and is singular!
Note that in many expression in multivariate statistics we actually need to use the inverse of the covariance matrix, e.g., in the density of the multivariate normal distribution, so it is essential that we obtain a non-singular invertible estimate of the covariance matrix.
Making the ML estimate of
There is a simple numerical trick credited to A. N. Tikhonov to make
The resulting
However, while this simple regularisation results in an invertible matrix the estimator itself has not improved over the MLE, and the matrix
Simple regularised estimate of
Regularised estimator
Regularisation:
Idea: choose
This form of estimator is corresponds to computing the mean of the Bayesian posterior by directly shrinking the MLE towards a prior mean (target):
- Prior information helps to infer
even in small samples. - also called shrinkage estimator since the off-diagonal entries are shrunk towards zero.
- this type of linear shrinkage/regularisation is natural for exponential family models (Diaconis and Ylvisaker, 1979).
- Instead of a diagonal target other options are possible, e.g. block-diagonal or banded covariances.
- If
is not prespecified but learned from data (see below) then the resulting estimate is an empirical Bayes estimator. - The resulting estimate will typically be biased as mixing in the target will increase the bias.
How to find optimal shrinkage / regularisation parameter
One way to do this is to chose
Bias-variance trade-off:
Challenge: since we don’t know the true
- by cross-validation (=resampling procedure)
- by using some analytic approximation (e.g. Stein’s formula)
In Worksheet 3 the empirical estimator of covariance is compared with the regularised covariance estimator implemented in the R package “corpcor”. This uses a regularisation similar as above (but for the correlation rather than the covariance matrix) and it employs an analytic data-adaptive estimate of the shrinkage intensity
2.6 Full Bayesian multivariate modelling
See also the section about multivariate distributions in the Probability and Distribution refresher for details about the distributions used below.
Three main scenarios
As discussed in MATH27720 Statistics 2 there are three main Bayesian models in the univariate case that cover a large range of applications:
- the beta-binomial model to estimate proportions
- the normal-normal model to estimate means
- the inverse gamma-normal model to estimate variances
Below we briefly sketch the extensions of these three elementary models to the multivariate case.
Dirichlet-multinomial model
This generalises the univariate beta-binomial model.
The Dirichlet distribution is useful as conjugate prior and posterior distribution for the parameters of a categorical distribution.
Data:
withMLE:
Prior parameters (Dirichlet in mean parametrisation):
, ,
Posterior parameters:
, with
Equivalent update rule (for Dirichlet with
parameter):
Multivariate normal-normal model
This generalises the univariate normal-normal model.
The multivariate normal distribution is useful as conjugate prior and posterior distribution of the mean:
Data:
with with known meanMLE:
Prior parameters:
,Posterior parameters:
, with
Inverse Wishart-normal model
This generalises the univariate inverse gamma-normal model for the variance.
The inverse Wishart distribution is useful as conjugate prior and posterior distribution of the covariance:
Data:
with with known meanMLE:
Prior parameters:
,Posterior parameters:
, with
Equivalent update rule:
,
2.7 Conclusion
- Multivariate models are often high-dimensional with large number of parameters but often only a small number of samples are available. In this instance it is useful (and often necessary) to introduce additional information (via priors or by regularisation).
- Unbiased estimation, a highly valued property in classical univariate statistics when sample size is large and number of parameters is small, is typically not a good idea in multivariate settings and often leads to poor estimators.
- Regularisation introduces bias and reduces variance, minimising overall MSE. Likewise, Bayesian estimators also introduce bias and regularise (via the prior) and thus are useful in multivariate settings.
Efron, B. 1979. Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7:1–26. https://doi.org/10.1214/aos/1176344552↩︎
Fisher, R. A. 1922. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A 222:309–368. https://doi.org/10.1098/rsta.1922.0009↩︎