Appendix A — Statistics refresher
Below you find a brief overview over some relevant concepts in statistics that you should be familiar with from earlier modules.
A.1 Data and statistics as functions of data
Broadly, by “data” we refer to quantitative observations and measurements collected in experiments.
We denote the observed data by \(D =\{x_1, \ldots, x_n\}\) where \(n\) denotes the number of data points (the sample size). Each data point can be scalar or a multivariate quantity.
Generally, a statistic \(t(D)\) is function of the observed data \(D\). The statistic \(t(D)\) can be of any type and value (scalar, vector, matrix etc. — even a function). \(t(D)\) is called a summary statistic if it describes important aspects of the data such as location (e.g. the average \(\text{avg}(D) =\bar{x}\), the median) or scale (e.g. standard deviation, interquartile range).
A.2 Statistical learning
The aim in statistics — data science — machine learning is to use data to learn about and understand the world using models. In statistics we employ probabilistic models.
Let denote data models by \(p(x| \theta)\) where \(\theta\) represents the parameters of the model. Often (but not always) \(\theta\) can be interpreted as and/or is associated with some manifest property of the model. If there is only a single parameter we write \(\theta\) (scalar parameter). If we wish to highlight that there are multiple parameters we write \(\symbfit \theta\) (in bold type).
Specifically, our aim is to identify the best model(s) for the data in order to both
- explain the current data, and
- to enable good prediction of future data.
Note that it is generally easy to find one or several models that explain the data but these then often do not predict well.
Therefore, one would like to avoid overfitting the data and identify models that are appropriate for the data at hand (i.e. not too simple but also not too complex).
Typically, we focus the analysis to a specific model family with a some parameter \(\theta\).
An estimator for \(\theta\) is a function \(\hat{\theta}(D)\) of the data that maps the data (input) to an informed guess (output) about \(\theta\).
- A point estimator provides a single number for each parameter
- An interval estimator provides a set of possible values for each parameter.
Interval estimators can be linked to the concept of testing specified values for a parameter. Specifically a confidence interval contains all parameter values that are not significantly different from the best parameter.
A.3 Sampling properties of a point estimator
A point estimator \(\hat\theta\) depends on the data, hence it exibits sampling variation, i.e. estimate will be different for a new set of observations.
Thus \(\hat\theta\) can be seen as a random variable, and its distribution is called sampling distribution (across different experiments).
Properties of this distribution can be used to evaluate how far the estimator deviates (on average across different experiments) from the true value:
\[\begin{align*} \begin{array}{rr} \text{Bias:}\\ \text{Variance:}\\ \text{Mean squared error:}\\ \\ \end{array} \begin{array}{rr} \text{Bias}(\hat{\theta})\\ \text{Var}(\hat{\theta})\\ \text{MSE}(\hat{\theta})\\ \\ \end{array} \begin{array}{ll} =\text{E}(\hat{\theta})-\theta\\ =\text{E}\left((\hat{\theta}-\text{E}(\hat{\theta}))^2\right)\\ =\text{E}((\hat{\theta}-\theta)^2)\\ =\text{Var}(\hat{\theta})+\text{Bias}(\hat{\theta})^2\\ \end{array} \end{align*}\]
The last identity about MSE follows from \(\text{E}(x^2)=\text{Var}(x)+\text{E}(x)^2\).
At first sight it seems desirable to focus on unbiased (for finite \(n\)) estimators. However, requiring strict unbiasedness is not always a good idea. In many situations it is better to allow for some small bias and in order to achieve a smaller variance and an overall total smaller MSE. This is called bias-variance tradeoff — as more bias is traded for smaller variance (or, conversely, less bias is traded for higher variance)
A.4 Efficiency and consistency of an estimator
Typically, \(\text{Bias}\), \(\text{Var}\) and \(\text{MSE}\) all decrease with increasing sample size so that with more data \(n \to \infty\) the errors become smaller and smaller.
Efficiency: An estimator \(\hat\theta_A\) is said to more efficient than estimator \(\hat\theta_B\) if for same sample size \(n\) it has smaller error (e.g. MSE) than the competing estimator.
The typical rate of decrease in variance of a good estimator is \(\frac{1}{n}\) and the rate of decrease in the standard deviation is \(\frac{1}{\sqrt{n}}\). Note that this implies that to get one digit more accuracy in an estimate (standard deviation decreasing by factor of 10) we need 100 times more data!
Consistency: \(\hat{\theta}\) is called consistent if \[ \text{MSE}(\hat{\theta}) \longrightarrow 0 \text{ with $n\rightarrow \infty$ } \]
Consistency is an essential but rather weak requirement for any reasonable estimator. Of all consistent estimators we typically select the estimators that are most efficient (i.e. with fasted decrease in MSE) and that therefore have smallest variance and/or MSE for given finite \(n\).
Consistency implies we recover the true model in the limit of infinite data if the model class contains the true data generating model. If the model class does not contain the true model then strict consistency cannot be achieved but we still wish to get as close as possible to the true model when choosing model parameters.
A.5 Empirical distribution function
Suppose we observe data \(D=\{x_1, \ldots, x_n\}\) with each \(x_i \sim F\) sampled independently and identically. The empirical cumulative distribution function \(\hat{F}_n(x)\) based on data \(D\) is then given by \[ \hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n 1_{x_i \leq x} \]
The empirical distribution function is monotonically non-decreasing from 0 to 1 in discrete steps.
In R the empirical distribution function is computed by ecdf()
.
Crucially, the empirical distribution \(\hat{F}_n\) converges strongly (almost surely) to the underlying distribution \(F\) as \(n \rightarrow \infty\): \[ \hat{F}_n\overset{a. s.}{\to} F \] The Glivenko–Cantelli theorem additionally asserts that the convergence is uniform.
Note this is effectively a variant of the law of large numbers applied to the whole distribution, rather than just the mean (see below).
As a result, we may use the empirical distribution \(\hat{F}_n\) based on data \(D\) as an estimate of the underlying unknown true distribution \(F\). From the convergence theorems we know that \(\hat{F}_n\) is consistent.
However, for \(\hat{F}_n\) to work well as an estimate of \(F\) the number of observations \(n\) must be sufficiently large so that the approximation provided by \(\hat{F}_n\) is adequate.
A.6 Empirical estimators
The fact that for large sample size \(n\) the empirical distribution \(\hat{F}_n\) may be used as a substitute for the unknown \(F\) allows us to easily construct empirical estimators.
Specifically, parameters of a model can typically be expressed as a functional of the distribution \(\theta = g(F)\). An empirical estimator \(\hat{\theta}\) is constructed by substituting the true distribution by the empirical distribution \(\hat{\theta}= g( \hat{F}_n )\).
An example is the mean \(\text{E}_F(x)\) with regard to \(F\). The empirical mean is the expectation with regard to the empirical distribution which equals the average of the samples: \[ \hat{\text{E}}(x) = \hat{\mu} = \text{E}_{\hat{F}_n}(x) = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \]
Similarly, other empirical estimators can be constructed simply by replacing the expectation in the definition of the quantity of interest by the sample average. For example, the empirical variance with unknown mean is given by \[ \widehat{\text{Var}}(x) = \widehat{\sigma^2} = \text{E}_{\hat{F}_n}((x - \hat{\mu})^2) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \] Note the factor \(1/n\) before the summation sign. We can also write the empirical variance in terms of \(\overline{x^2} =\frac{1}{n}\sum^{n}_{k=1} x^2\) as \[ \widehat{\text{Var}}(x) = \overline{x^2} - \bar{x}^2 \]
By construction, as a result of the strong convergence of \(\hat{F}_n\) to \(F\) empirical estimators are consistent, with their MSE, variance and bias all decreasing to zero with large sample size \(n\). However, for finite sample size they do have a finite variance and may also be biased.
For example, the empirical variance given above is biased with \(\text{Bias}(\widehat{\sigma^2}) = -\sigma^2/n\). Note this bias decreases with \(n\). An unbiased estimator can be obtained by rescaling the empirical estimator by the factor \(n/(n-1)\): \[ \widehat{\sigma^2}_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2 \]
The empirical estimators for the mean and variance can also be obtained for random vectors \(\symbfit x\). In this case the data \(D = \{\symbfit x_1, \ldots, \symbfit x_n \}\) is comprised of \(n\) vector-valued observations.
For the mean get \[ \hat{\symbfit \mu} = \frac{1}{n}\sum^{n}_{k=1} \symbfit x_k = \bar{\symbfit x} \] and for the covariance \[\widehat{\symbfit \Sigma} = \frac{1}{n}\sum^{n}_{k=1} \left(\symbfit x_k-\bar{\symbfit x}\right) \; \left(\symbfit x_k-\bar{\symbfit x}\right)^T\] Note the factor \(\frac{1}{n}\) in the estimator of the covariance matrix.
With \(\overline{\symbfit x\symbfit x^T} = \frac{1}{n}\sum^{n}_{k=1} \symbfit x_k \symbfit x_k^T\) we can also write \[ \widehat{\symbfit \Sigma} = \overline{\symbfit x\symbfit x^T} - \bar{\symbfit x} \bar{\symbfit x}^T \]
A.7 Law of large numbers
The law of large numbers was discovered by Jacob Bernoulli (1655-1705) and states that the average converges to the mean.
Since \(\hat{F}_n\) convergences strongly to \(F\) as \(n \rightarrow \infty\) there is corresponding convergence of the average \(\text{E}_{\hat{F}_n}(h(x)) = \frac{1}{n} \sum_{i=1}^n h(x_i)\) to the expectation \(\text{E}_{F}(h(x))\).
In other words, if the mean exists then for sufficiently large \(n\) it can be substituted by the empirical mean.
Likewise, the law of large numbers can be applied to empirical estimators to show that they will converge to the corresponding true quantities for sufficiently large \(n\).
Furthermore, one may use the law of large numbers as a justification to interpret large-sample limits of frequencies as probabilities. However, the converse , namely requesting that all probabilities must have a frequentist interpretation, does not follow from the law of large numbers or from the axioms of probability.
Finally, it is worth pointing out that the law of large number says nothing about the finite sample properties of estimators.
A.8 Sampling distribution of mean and variance estimators for normal data
If the underlying distribution family of \(D = \{x_1, \ldots, x_n\}\) is known we can often obtain the exact distribution of an estimator.
For example, assuming normal distribution \(N(\mu, \sigma^2)\) we can derive the sampling distribution for the empirical mean and variance:
The empirical estimator of the mean parameter \(\mu\) is given by \(\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i\). Under the normal assumption the distribution of \(\hat{\mu}\) is \[ \hat{\mu} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \] Thus \(\text{E}(\hat{\mu}) = \mu\) and \(\text{Var}(\hat{\mu}) = \frac{\sigma^2}{n}\). The estimate \(\hat{\mu}\) is unbiased as \(\text{E}(\hat{\mu})-\mu = 0\). The mean squared error of \(\hat{\mu}\) is \(\text{MSE}(\hat{\mu}) = \frac{\sigma^2}{n}\).
The empirical variance \(\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (x_i -\bar{x})^2\) for normal data follows a one-dimensional Wishart distribution \[ \widehat{\sigma^2} \sim W_1\left(s^2 = \frac{\sigma^2}{n}, k=n-1\right) \] Thus, \(\text{E}( \widehat{\sigma^2} ) = \frac{n-1}{n}\sigma^2\) and \(\text{Var}( \widehat{\sigma^2}_{\text{ML}} ) = \frac{2(n-1)}{n^2}\sigma^4\). The estimate \(\widehat{\sigma^2}\) is biased since \(\text{E}( \widehat{\sigma^2}_{\text{ML}} )-\sigma^2 = -\frac{1}{n}\sigma^2\). The mean squared error is \(\text{MSE}( \widehat{\sigma^2}) = \frac{2(n-1)}{n^2}\sigma^4 +\frac{1}{n^2}\sigma^4 =\frac{2 n-1}{n^2}\sigma^4\).
The unbiased variance estimate \(\widehat{\sigma^2}_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2\) for normal data follows a one-dimensional Wishart distribution \[ \widehat{\sigma^2}_{\text{UB}} \sim W_1\left(s^2 = \frac{\sigma^2}{n-1}, k = n-1 \right) \] Thus, \(\text{E}( \widehat{\sigma^2}_{\text{UB}} ) = \sigma^2\) and \(\text{Var}( \widehat{\sigma^2}_{\text{UB}} ) = \frac{2}{n-1}\sigma^4\). The estimate \(\widehat{\sigma^2}_{\text{ML}}\) is unbiased since \(\text{E}( \widehat{\sigma^2}_{\text{UB}} )-\sigma^2 =0\). The mean squared error is \(\text{MSE}( \widehat{\sigma^2}_{\text{UB}} ) =\frac{2}{n-1}\sigma^4\).
Interestingly, for any \(n>1\) we find that \(\text{Var}( \widehat{\sigma^2}_{\text{UB}} ) > \text{Var}( \widehat{\sigma^2}_{\text{ML}} )\) and \(\text{MSE}( \widehat{\sigma^2}_{\text{UB}} ) > \text{MSE}( \widehat{\sigma^2}_{\text{ML}} )\) so that the biased empirical estimator has both lower variance and lower mean squared error than the unbiased estimator.
A.9 \(t\)-statistics
One sample \(t\)-statistic
Suppose we observe \(n\) independent data points \(x_1, \ldots, x_n \sim N(\mu, \sigma^2)\). Then the average \(\bar{x} = \sum_{i=1}^n x_i\) is distributed as \(\bar{x} \sim N(\mu, \sigma^2/n)\) and correspondingly \[ z = \frac{\bar{x}-\mu}{\sqrt{\sigma^2/n}} \sim N(0, 1) \]
Note that \(z\) uses the known variance \(\sigma^2\).
If the variance is unknown and is estimated by the unbiased \(s^2_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2\) then one arrives at the one sample \(t\)-statistic \[ t_{\text{UB}} = \frac{\bar{x}-\mu}{\sqrt{s^2_{\text{UB}}/n}} \sim t_{n-1} \,. \] It is distributed according to a Student’s \(t\)-distribution with \(n-1\) degrees of freedom, with mean 0 for \(n>2\) and variance \((n-1)/(n-3)\) for \(n>3\).
If instead of the unbiased estimate the empirical (ML) estimate of the variance \(s^2_{\text{ML}} = \frac{1}{n} \sum_{i=1}^n (x_i -\bar{x})^2 = \frac{n-1}{n} s^2_{\text{UB}}\) is used then this leads to a slightly different statistic \[ t_{\text{ML}} = \frac{\bar{x}-\mu}{ \sqrt{ s^2_{\text{ML}}/n}} = \sqrt{\frac{n}{n-1}} t_{\text{UB}} \] with \[ t_{\text{ML}} \sim \text{lst}\left(0, \tau^2=\frac{n}{n-1}, n-1\right) \] Thus, \(t_{\text{ML}}\) follows a location-scale \(t\)-distribution, with mean 0 for \(n>2\) and variance \(n/(n-3)\) for \(n>3\).
Two sample \(t\)-statistic with common variance
Now suppose we observe normal data \(D = \{x_1, \ldots, x_n\}\) from two groups with sample size \(n_1\) and \(n_2\) (and \(n=n_1+n_2\)) with two different means \(\mu_1\) and \(\mu_2\) and common variance \(\sigma^2\): \[x_1,\dots,x_{n_1} \sim N(\mu_1, \sigma^2)\] and \[x_{n_1+1},\dots,x_{n} \sim N(\mu_2, \sigma^2)\] Then \(\hat{\mu}_1 = \frac{1}{n_1}\sum^{n_1}_{i=1}x_i\) and \(\hat{\mu}_2 = \frac{1}{n_2}\sum^{n}_{i=n_1+1}x_i\) are the sample averages within each group.
The common variance \(\sigma^2\) may be estimated either by the unbiased estimate \(s^2_{\text{UB}} = \frac{1}{n-2} \left(\sum^{n_1}_{i=1}(x_i-\hat{\mu}_1)^2+\sum^n_{i=n_1+1}(x_i-\hat{\mu}_2)^2\right)\) (note the factor \(n-2\)) or by the empirical (ML) estimate \(s^2_{\text{ML}} = \frac{1}{n} \left(\sum^{n_1}_{i=1}(x_i-\hat{\mu}_1)^2+\sum^n_{i=n_1+1}(x_i-\hat{\mu}_2)^2\right) =\frac{n-2}{n} s^2_{\text{UB}}\). The estimator for the common variance is a often referred to as pooled variance estimate as information is pooled from two groups to obtain the estimate.
Using the unbiased pooled variance estimate the two sample \(t\)-statistic is given by \[ t_{\text{UB}} = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{n_1}+\frac{1}{n_2}\right) s^2_{\text{UB}}} } = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{n}{n_1 n_2} \right) s^2_{\text{UB}} } } \] In terms of empirical frequencies \(\hat{\pi}_1 = \frac{n_1}{n}\) and \(\hat{\pi}_2 = \frac{n_2}{n}\) it can also be written as \[ t_{\text{UB}} = \sqrt{n} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{\hat{\pi}_1}+\frac{1}{\hat{\pi}_2}\right) s^2_{\text{UB}} }} = \sqrt{n\hat{\pi}_1 \hat{\pi}_2} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ s^2_{\text{UB}}}} \] The two sample \(t\)-statistic is distributed as \[ t_{\text{UB}} \sim t_{n-2} \] i.e. according to a Student’s \(t\)-distribution with \(n-2\) degrees of freedom, with mean 0 for \(n>3\) and variance \((n-2)/(n-4)\) for \(n>4\). Large values of the two sample \(t\)-statistic indicates that there are indeed two groups rather than just one.
The two sample \(t\)-statistic using the empirical (ML) pooled estimate of the variance is \[ \begin{split} t_{\text{ML}} & = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{n_1}+\frac{1}{n_2}\right) s^2_{\text{ML}} } } = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{n}{n_1 n_2}\right) s^2_{\text{ML}} } }\\ & =\sqrt{n} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{\hat{\pi}_1}+\frac{1}{\hat{\pi}_2}\right) s^2_{\text{ML}} }} = \sqrt{n \hat{\pi}_1 \hat{\pi}_2 } \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ s^2_{\text{ML}}}}\\ & = \sqrt{\frac{n}{n-2}} t_{\text{UB}} \end{split} \] with \[ t_{\text{ML}} \sim \text{lst}\left(0, \tau^2=\frac{n}{n-2}, n-2\right) \] Thus, \(t_{\text{ML}}\) follows a location-scale \(t\)-distribution, with mean 0 for \(n>3\) and variance \(n/(n-4)\) for \(n>4\).
A.10 Confidence intervals
General concept
A confidence interval (CI) is an interval estimate with a frequentist interpretation.
The coverage \(\kappa\) of a confidence interval describes how often (in repeated identical experiments) the estimated confidence interval overlaps the true parameter value \(\theta\) (see Figure A.1). For example, coverage \(\kappa=0.95\) (95%) means that in 95 out of 100 case the estimated confidence will contain the (unknown) true value (i.e. it will “cover” \(\theta\)).
Note that a confidence is actually an estimate \(\widehat{\text{CI}}(x_1, \ldots, x_n)\) and that it depends on data and has a random (sampling) variation.
A good confidence interval has high coverage and is compact.
Note: the coverage probability is not the probability that the true value is contained in a given estimated interval (that would be the Bayesian credible interval).
Furthermore, there is a direct relationship between confidence intervals and statistical testing procedures. Specifically, a confidence interval can be interpreted as the set of parameter values that cannot be rejected.
Symmetric normal confidence interval
For a normally distributed univariate random variable it is straightforward to construct a symmetric two-sided CI with a given desired coverage \(\kappa\) (Figure A.2).
For a normal random variable \(x \sim N(\mu, \sigma^2)\) with mean \(\mu\) and variance \(\sigma^2\) and density function \(p(x|\mu, \sigma^2)\) we can compute the probability \[ \text{Pr}(x \leq \mu + c \sigma) = \int_{-\infty}^{\mu+c\sigma} p(x|\mu, \sigma^2) dx = \Phi (c) = \frac{1+\kappa}{2} \] where \(\Phi(c)\) is the cumulative distribution function (CDF) of the standard normal \(N(0,1)\) distribution. The critical point \(c\) is obtain by inversion of \(\Phi\): \[ c=\Phi^{-1}\left(\frac{1+\kappa}{2}\right) \]
Coverage \(\kappa\) | Critical value \(c\) |
---|---|
0.9 | 1.64 |
0.95 | 1.96 |
0.99 | 2.58 |
Table A.1 lists the critical values \(c\) for the three most commonly used values of \(\kappa\). It is useful to memorise these values as they will be used frequently.
A symmetric standard normal CI with nominal coverage \(\kappa\) for
- a scalar parameter \(\theta\)
- with normally distributed estimate \(\hat{\theta}\) and
- with estimated standard deviation \(\hat{\text{SD}}(\hat{\theta}) = \hat{\sigma}\)
is given by \[ \widehat{\text{CI}}=[\hat{\theta} \pm c \hat{\sigma}] \] where \(c\) is chosen for desired coverage level \(\kappa\).
Confidence interval based on the chi-squared distribution
For the chi-squared distribution we use a one-sided interval to compute the critical values (see Figure A.3) using \[ \text{Pr}(x \leq c) = \kappa \] As before we get \(c\) by the quantile function, i.e. by inverting the CDF of the chi-squared distribution.
Coverage \(\kappa\) | Critical value \(c\) |
---|---|
0.9 | 2.71 |
0.95 | 3.84 |
0.99 | 6.63 |
Table A.2 lists the critical values for the three most common choice of \(\kappa\) one degree of freedom. Note that these critical values are by construction the squares of the corresponding values in Table A.1 (disregarding the rounding errors).
A one-sided CI with nominal coverage \(\kappa\) is then given by the interval \([0, c ]\).