Appendix A — Statistics refresher

Below you find a brief overview over some relevant concepts in statistics that you should be familiar with from earlier modules.

A.1 Data and statistics as functions of data

Broadly, by “data” we refer to quantitative observations and measurements collected in experiments.

We denote the observed data by \(D =\{x_1, \ldots, x_n\}\) where \(n\) denotes the number of data points (the sample size). Each data point can be scalar or a multivariate quantity.

Generally, a statistic \(t(D)\) is function of the observed data \(D\). The statistic \(t(D)\) can be of any type and value (scalar, vector, matrix etc. — even a function). \(t(D)\) is called a summary statistic if it describes important aspects of the data such as location (e.g. the average \(\text{avg}(D) =\bar{x}\), the median) or scale (e.g. standard deviation, interquartile range).

A.2 Statistical learning

The aim in statistics — data science — machine learning is to use data to learn about and understand the world using models. In statistics we employ probabilistic models.

Let denote data models by \(p(x| \theta)\) where \(\theta\) represents the parameters of the model. Often (but not always) \(\theta\) can be interpreted as and/or is associated with some manifest property of the model. If there is only a single parameter we write \(\theta\) (scalar parameter). If we wish to highlight that there are multiple parameters we write \(\boldsymbol \theta\) (in bold type).

Specifically, our aim is to identify the best model(s) for the data in order to both

  • explain the current data, and
  • to enable good prediction of future data.

By choosing a sufficiently complex model the first aim (to explain the observed data) is often easily achieved. However, if the model is too complex it can fail to address the second aim (to predict unseen data well). Thus, when choosing a model we would like to avoid both the problem of underfitting (i.e. choosing an overly simplistic model) as well overfitting (i.e. choosing an overly complex model). Finding this balances will also help with interpreting the fitted model.

Typically, we focus the analysis to a specific model family with a some parameter \(\theta\).
An estimator for \(\theta\) is a function \(\hat{\theta}(D)\) of the data that maps the data (input) to an informed guess (output) about \(\theta\).

  • A point estimator provides a single number for each parameter
  • An interval estimator provides a set of possible values for each parameter.

Interval estimators can be linked to the concept of testing specified values for a parameter. Specifically a confidence interval contains all parameter values that are not significantly different from the best parameter.

A.3 Sampling properties of a point estimator

A point estimator \(\hat\theta\) depends on the data, hence it exibits sampling variation, i.e. estimate will be different for a new set of observations.

Thus \(\hat\theta\) can be seen as a random variable, and its distribution is called sampling distribution (across different experiments).

Properties of this distribution can be used to evaluate how far the estimator deviates (on average across different experiments) from the true value:

\[\begin{align*} \begin{array}{rr} \text{Bias:}\\ \text{Variance:}\\ \text{Mean squared error:}\\ \\ \end{array} \begin{array}{rr} \text{Bias}(\hat{\theta})\\ \text{Var}(\hat{\theta})\\ \text{MSE}(\hat{\theta})\\ \\ \end{array} \begin{array}{ll} =\text{E}(\hat{\theta})-\theta\\ =\text{E}\left((\hat{\theta}-\text{E}(\hat{\theta}))^2\right)\\ =\text{E}((\hat{\theta}-\theta)^2)\\ =\text{Var}(\hat{\theta})+\text{Bias}(\hat{\theta})^2\\ \end{array} \end{align*}\]

The last identity about MSE follows from \(\text{E}(x^2)=\text{Var}(x)+\text{E}(x)^2\).

At first sight it seems desirable to focus on unbiased (for finite sample size \(n\)) estimators. However, requiring strict unbiasedness is not always a good idea. In many situations it is better to accept some bias in an estimator in order to achieve a smaller variance and an overall smaller MSE. This is called bias-variance tradeoff — as more bias is traded for smaller variance (or, conversely, less bias is traded for higher variance). This is also related to the above mentioned problems of underfitting (large bias) and overfitting (large variance).

A.4 Efficiency and consistency of an estimator

Typically, \(\text{Bias}\), \(\text{Var}\) and \(\text{MSE}\) all decrease with increasing sample size so that with more data \(n \to \infty\) the errors become smaller and smaller.

Efficiency: An estimator \(\hat\theta_A\) is said to more efficient than estimator \(\hat\theta_B\) if for same sample size \(n\) it has smaller error (e.g. MSE) than the competing estimator.

The typical rate of decrease in variance of a good estimator is \(\frac{1}{n}\) and the rate of decrease in the standard deviation is \(\frac{1}{\sqrt{n}}\). Note that this implies that to get one digit more accuracy in an estimate (standard deviation decreasing by factor of 10) we need 100 times more data!

Consistency: \(\hat{\theta}\) is called consistent if \[ \text{MSE}(\hat{\theta}) \longrightarrow 0 \text{ with $n\rightarrow \infty$ } \]

Consistency is an essential but rather weak requirement for any reasonable estimator. Of all consistent estimators we typically select the estimators that are most efficient (i.e. with fasted decrease in MSE) and that therefore have smallest variance and/or MSE for given finite \(n\).

Consistency implies that the true model is recovered in the limit of infinite data, assuming that the model class contains the true data generating model. If the model class does not contain the true model then strict consistency cannot be achieved. However, we still aim to select a model that is as close as possible (i.e. approximates as best as possible) the true model.

A.5 Law of large numbers

The law of large numbers, discovered by Jacob Bernoulli (1655-1705), asserts that, if the mean exists, the sample average will converge to the mean as the sample size \(n\) becomes large. Therefore, when the mean is defined, it can be approximated by the empirical mean for sufficiently large values of \(n\).

A variant of the law of large numbers is that the empirical distribution \(\hat{F}_n\) convergences strongly to \(F\) (Section A.6).

As a result, there is also convergence of the average of functions of samples to the corresponding expectation as \(n \rightarrow \infty\): \[ \text{E}_{\hat{F}_n}(h(x)) = \frac{1}{n} \sum_{i=1}^n h(x_i) \to \text{E}_{F}(h(x)) \] Hence, the law of large numbers also ensures that empirical estimators (Section A.7) will converge to the corresponding true values for sufficiently large \(n\).

Moreover, the law of large numbers provides a justification for interpreting large-sample limits of frequencies as probabilities. However, the converse assumption that all probabilities can be interpreted in frequentist manner does not follow from the law of large numbers or from the axioms of probability.

Finally, it is worth pointing out that the law of large number doesn’t say anything about the finite sample properties of an estimator, it is only concerned with their asymptotic properties with \(n\) going to infinity.

A.6 Empirical distribution function

Suppose we observe data \(D=\{x_1, \ldots, x_n\}\) with each \(x_i \sim F\) sampled independently and identically. The empirical cumulative distribution function \(\hat{F}_n(x)\) based on data \(D\) is then given by \[ \hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n 1_{x_i \leq x} \]

The empirical distribution function is monotonically non-decreasing from 0 to 1 in discrete steps.

In R the empirical distribution function is computed by ecdf().

Crucially, the empirical distribution \(\hat{F}_n\) converges strongly (almost surely) to the underlying distribution \(F\) as \(n \rightarrow \infty\): \[ \hat{F}_n\overset{a. s.}{\to} F \] The Glivenko–Cantelli theorem additionally asserts that the convergence is uniform.

This theorem is a variant of the law of large numbers (Section A.5) applied to the whole distribution, rather than just to the mean.

As a result, we may use the empirical distribution \(\hat{F}_n\) based on data \(D\) as an estimate of the underlying unknown true distribution \(F\). From the convergence theorems we know that \(\hat{F}_n\) is consistent.

However, for \(\hat{F}_n\) to work well as an estimate of \(F\) the number of observations \(n\) must be sufficiently large so that the approximation provided by \(\hat{F}_n\) is adequate.

A.7 Empirical estimators

The fact that for large sample size \(n\) the empirical distribution \(\hat{F}_n\) may be used as a substitute for the unknown \(F\) allows us to easily construct empirical estimators.

Specifically, parameters of a model can typically be expressed as a functional of the distribution \(\theta = g(F)\). An empirical estimator \(\hat{\theta}\) is constructed by substituting the true distribution by the empirical distribution \(\hat{\theta}= g( \hat{F}_n )\).

An example is the mean \(\text{E}_F(x)\) with regard to \(F\). The empirical mean is the expectation with regard to the empirical distribution which equals the average of the samples: \[ \hat{\text{E}}(x) = \hat{\mu} = \text{E}_{\hat{F}_n}(x) = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \]

Similarly, other empirical estimators can be constructed simply by replacing the expectation in the definition of the quantity of interest by the sample average. For example, the empirical variance with unknown mean is given by \[ \widehat{\text{Var}}(x) = \widehat{\sigma^2} = \text{E}_{\hat{F}_n}((x - \hat{\mu})^2) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \] Note the factor \(1/n\) before the summation sign. We can also write the empirical variance in terms of \(\overline{x^2} =\frac{1}{n}\sum^{n}_{k=1} x^2\) as \[ \widehat{\text{Var}}(x) = \overline{x^2} - \bar{x}^2 \]

By construction, as a result of the strong convergence of \(\hat{F}_n\) to \(F\) empirical estimators are consistent, with their MSE, variance and bias all decreasing to zero with large sample size \(n\). However, for finite sample size they do have a finite variance and may also be biased.

For example, the empirical variance given above is biased with \(\text{Bias}(\widehat{\sigma^2}) = -\sigma^2/n\). Note this bias decreases with \(n\). An unbiased estimator can be obtained by rescaling the empirical estimator by the factor \(n/(n-1)\): \[ \widehat{\sigma^2}_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2 \]

The empirical estimators for the mean and variance can also be obtained for random vectors \(\boldsymbol x\). In this case the data \(D = \{\boldsymbol x_1, \ldots, \boldsymbol x_n \}\) is comprised of \(n\) vector-valued observations.

For the mean get \[ \hat{\boldsymbol \mu} = \frac{1}{n}\sum^{n}_{k=1} \boldsymbol x_k = \bar{\boldsymbol x} \] and for the covariance \[\widehat{\boldsymbol \Sigma} = \frac{1}{n}\sum^{n}_{k=1} \left(\boldsymbol x_k-\bar{\boldsymbol x}\right) \; \left(\boldsymbol x_k-\bar{\boldsymbol x}\right)^T\] Note the factor \(\frac{1}{n}\) in the estimator of the covariance matrix.

With \(\overline{\boldsymbol x\boldsymbol x^T} = \frac{1}{n}\sum^{n}_{k=1} \boldsymbol x_k \boldsymbol x_k^T\) we can also write \[ \widehat{\boldsymbol \Sigma} = \overline{\boldsymbol x\boldsymbol x^T} - \bar{\boldsymbol x} \bar{\boldsymbol x}^T \]

A.8 Sampling distribution of mean and variance estimators for normal data

If the underlying distribution family of \(D = \{x_1, \ldots, x_n\}\) is known we can often obtain the exact distribution of an estimator.

For example, assuming normal distribution \(x_i \sim N(\mu, \sigma^2)\) we can derive the sampling distribution for the empirical mean and variance:

  • The empirical estimator of the mean parameter \(\mu\) is given by \(\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i\). Under the normal assumption the distribution of \(\hat{\mu}\) is \[ \hat{\mu} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \] Thus \(\text{E}(\hat{\mu}) = \mu\) and \(\text{Var}(\hat{\mu}) = \frac{\sigma^2}{n}\). The estimate \(\hat{\mu}\) is unbiased as \(\text{E}(\hat{\mu})-\mu = 0\). The mean squared error of \(\hat{\mu}\) is \(\text{MSE}(\hat{\mu}) = \frac{\sigma^2}{n}\).

  • The empirical variance \(\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (x_i -\bar{x})^2\) for normal data follows a one-dimensional Wishart distribution \[ \widehat{\sigma^2} \sim W\left(s^2 = \frac{\sigma^2}{n}, k=n-1\right) \] Thus, \(\text{E}( \widehat{\sigma^2} ) = \frac{n-1}{n}\sigma^2\) and \(\text{Var}( \widehat{\sigma^2}_{\text{ML}} ) = \frac{2(n-1)}{n^2}\sigma^4\). The estimate \(\widehat{\sigma^2}\) is biased since \(\text{E}( \widehat{\sigma^2}_{\text{ML}} )-\sigma^2 = -\frac{1}{n}\sigma^2\). The mean squared error is \(\text{MSE}( \widehat{\sigma^2}) = \frac{2(n-1)}{n^2}\sigma^4 +\frac{1}{n^2}\sigma^4 =\frac{2 n-1}{n^2}\sigma^4\).

  • The unbiased variance estimate \(\widehat{\sigma^2}_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2\) for normal data follows a one-dimensional Wishart distribution \[ \widehat{\sigma^2}_{\text{UB}} \sim W\left(s^2 = \frac{\sigma^2}{n-1}, k = n-1 \right) \] Thus, \(\text{E}( \widehat{\sigma^2}_{\text{UB}} ) = \sigma^2\) and \(\text{Var}( \widehat{\sigma^2}_{\text{UB}} ) = \frac{2}{n-1}\sigma^4\). The estimate \(\widehat{\sigma^2}_{\text{ML}}\) is unbiased since \(\text{E}( \widehat{\sigma^2}_{\text{UB}} )-\sigma^2 =0\). The mean squared error is \(\text{MSE}( \widehat{\sigma^2}_{\text{UB}} ) =\frac{2}{n-1}\sigma^4\).

    Interestingly, for any \(n>1\) we find that \(\text{Var}\left( \widehat{\sigma^2}_{\text{UB}} \right) > \text{Var}\left( \widehat{\sigma^2}_{\text{ML}} \right)\) and \(\text{MSE}\left( \widehat{\sigma^2}_{\text{UB}} \right) > \text{MSE}\left( \widehat{\sigma^2}_{\text{ML}} \right)\) so that the biased empirical estimator has both lower variance and lower mean squared error than the unbiased estimator.

A.9 \(t\)-statistics

One sample \(t\)-statistic

Suppose we observe \(n\) independent data points \(x_1, \ldots, x_n \sim N(\mu, \sigma^2)\). Then the average \(\bar{x} = \sum_{i=1}^n x_i\) is distributed as \(\bar{x} \sim N(\mu, \sigma^2/n)\) and correspondingly \[ z = \frac{\bar{x}-\mu}{\sqrt{\sigma^2/n}} \sim N(0, 1) \]

Note that \(z\) uses the known variance \(\sigma^2\).

If the variance is unknown and is estimated by the unbiased variance
\[ s^2_{\text{UB}} = \frac{1}{n-1} \sum_{i=1}^n (x_i -\bar{x})^2 \] then one arrives at the one sample \(t\)-statistic \[ t_{\text{UB}} = \frac{\bar{x}-\mu}{\sqrt{s^2_{\text{UB}}/n}} \sim \text{$t_{n-1}$} \,. \] It is distributed according to a Student’s \(t\)-distribution with \(n-1\) degrees of freedom, with mean 0 for \(n>2\) and variance \((n-1)/(n-3)\) for \(n>3\).

If instead of the unbiased estimate the empirical variance (i.e. the maximum likelihood estimate, ML) \[ s^2_{\text{ML}} = \frac{1}{n} \sum_{i=1}^n (x_i -\bar{x})^2 = \frac{n-1}{n} s^2_{\text{UB}} \] is used then this leads to a slightly different statistic \[ t_{\text{ML}} = \frac{\bar{x}-\mu}{ \sqrt{ s^2_{\text{ML}}/n}} = \sqrt{\frac{n}{n-1}} t_{\text{UB}} \] with \[ t_{\text{ML}} \sim \text{$t_{n-1}$}\left(0, \tau^2=\frac{n}{n-1}\right) \] Thus, \(t_{\text{ML}}\) follows a location-scale \(t\)-distribution, with mean 0 for \(n>2\) and variance \(n/(n-3)\) for \(n>3\).

Two sample \(t\)-statistic with common variance

Now suppose we observe normal data \(D = \{x_1, \ldots, x_n\}\) from two groups with sample size \(n_1\) and \(n_2\) (and \(n=n_1+n_2\)) with two different means \(\mu_1\) and \(\mu_2\) and common variance \(\sigma^2\): \[x_1,\dots,x_{n_1} \sim N(\mu_1, \sigma^2)\] and \[x_{n_1+1},\dots,x_{n} \sim N(\mu_2, \sigma^2)\] Then \(\hat{\mu}_1 = \frac{1}{n_1}\sum^{n_1}_{i=1}x_i\) and \(\hat{\mu}_2 = \frac{1}{n_2}\sum^{n}_{i=n_1+1}x_i\) are the sample averages within each group.

The common variance \(\sigma^2\) may be estimated either by the unbiased estimate \[ s^2_{\text{UB}} = \frac{1}{n-2} \left(\sum^{n_1}_{i=1}(x_i-\hat{\mu}_1)^2+ \sum^n_{i=n_1+1}(x_i-\hat{\mu}_2)^2\right) \] (note the factor \(n-2\)) or by the empirical estimate (ML)
\[ s^2_{\text{ML}} = \frac{1}{n} \left(\sum^{n_1}_{i=1}(x_i-\hat{\mu}_1)^2+\sum^n_{i=n_1+1}(x_i-\hat{\mu}_2)^2\right) =\frac{n-2}{n} s^2_{\text{UB}} \] The estimator for the common variance is a often referred to as pooled variance estimate as information is pooled from two groups to obtain the estimate.

Using the unbiased pooled variance estimate the two sample \(t\)-statistic is given by \[ t_{\text{UB}} = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{n_1}+\frac{1}{n_2}\right) s^2_{\text{UB}}} } = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{n}{n_1 n_2} \right) s^2_{\text{UB}} } } \] In terms of empirical frequencies \(\hat{\pi}_1 = \frac{n_1}{n}\) and \(\hat{\pi}_2 = \frac{n_2}{n}\) it can also be written as \[ t_{\text{UB}} = \sqrt{n} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{\hat{\pi}_1}+\frac{1}{\hat{\pi}_2}\right) s^2_{\text{UB}} }} = \sqrt{n\hat{\pi}_1 \hat{\pi}_2} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ s^2_{\text{UB}}}} \] The two sample \(t\)-statistic is distributed as \[ t_{\text{UB}} \sim \text{$t_{n-2}$} \] i.e. according to a Student’s \(t\)-distribution with \(n-2\) degrees of freedom, with mean 0 for \(n>3\) and variance \((n-2)/(n-4)\) for \(n>4\). Large values of the two sample \(t\)-statistic indicates that there are indeed two groups rather than just one.

The two sample \(t\)-statistic using the empirical (ML) pooled estimate of the variance is \[ \begin{split} t_{\text{ML}} & = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{n_1}+\frac{1}{n_2}\right) s^2_{\text{ML}} } } = \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{n}{n_1 n_2}\right) s^2_{\text{ML}} } }\\ & =\sqrt{n} \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ \left(\frac{1}{\hat{\pi}_1}+\frac{1}{\hat{\pi}_2}\right) s^2_{\text{ML}} }} = \sqrt{n \hat{\pi}_1 \hat{\pi}_2 } \frac{\hat{\mu}_1-\hat{\mu}_2}{ \sqrt{ s^2_{\text{ML}}}}\\ & = \sqrt{\frac{n}{n-2}} t_{\text{UB}} \end{split} \] with \[ t_{\text{ML}} \sim \text{$t_{n-2}$}\left(0, \tau^2=\frac{n}{n-2}\right) \] Thus, \(t_{\text{ML}}\) follows a location-scale \(t\)-distribution, with mean 0 for \(n>3\) and variance \(n/(n-4)\) for \(n>4\).

A.10 Confidence intervals

General concept

A confidence interval (CI) is an interval estimate \(\widehat{\text{CI}}(x_1, \ldots, x_n)\) that depends on data and has a random (sampling) variation as well as a frequentist interpretation.

Figure A.1: Coverage property of confidence intervals.

Specifically, the coverage \(\kappa\) of a confidence interval describes how often — in repeated identical experiments — it overlaps the true parameter value \(\theta\), i.e. how often it will “cover” \(\theta\) (see Figure A.1). For example, coverage \(\kappa=0.95\) (95%) means that in 95 out of 100 cases the estimated confidence interval will contain the (unknown) true value.
It is trivial to create a confidence interval with high coverage, simply by assuming a wide interval. Therefore, a useful confidence interval must be compact and have high coverage.

Note: the coverage probability is not the probability that the true value is contained in the confidence interval (that interpretation only applies to Bayesian credible intervals).

Furthermore, there is a direct relationship between confidence intervals and statistical testing procedures. Specifically, a confidence interval can be interpreted as the set of parameter values that cannot be rejected.

Symmetric normal confidence interval

Figure A.2: Construction of a symmetric two-sided normal confidence interval.

For a normally distributed univariate random variable it is straightforward to construct a symmetric two-sided CI with a given desired coverage \(\kappa\) (Figure A.2).

For a normal random variable \(x \sim N(\mu, \sigma^2)\) with mean \(\mu\) and variance \(\sigma^2\) and density function \(p(x|\mu, \sigma^2)\) we can compute the probability \[ \text{Pr}(x \leq \mu + c \sigma) = \int_{-\infty}^{\mu+c\sigma} p(x|\mu, \sigma^2) dx = \Phi (c) = \frac{1+\kappa}{2} \] where \(\Phi(c)\) is the cumulative distribution function (CDF) of the standard normal \(N(0,1)\) distribution. The critical point \(c\) is obtain by inversion of \(\Phi\): \[ c=\Phi^{-1}\left(\frac{1+\kappa}{2}\right) \]

Table A.1: Critical values for the normal distribution.
Coverage \(\kappa\) Critical value \(c\)
0.9 1.64
0.95 1.96
0.99 2.58

Table A.1 lists the critical values \(c\) for the three most commonly used values of \(\kappa\). It is useful to memorise these values as they will be used frequently.

A symmetric standard normal CI with nominal coverage \(\kappa\) for

  • a scalar parameter \(\theta\)
  • with normally distributed estimate \(\hat{\theta}\) and
  • with estimated standard deviation \(\hat{\text{SD}}(\hat{\theta}) = \hat{\sigma}\)

is given by \[ \widehat{\text{CI}}=[\hat{\theta} \pm c \hat{\sigma}] \] where \(c\) is chosen for desired coverage level \(\kappa\).

Confidence interval based on the chi-squared distribution

Figure A.3: Construction of a one-sided confidence interval based on a chi-squared distribution with one degree of freedom.

For the chi-squared distribution we use a one-sided interval to compute the critical values (see Figure A.3) using \[ \text{Pr}(x \leq c) = \kappa \] As before we get \(c\) by the quantile function, i.e. by inverting the CDF of the chi-squared distribution.

Table A.2: Critical values for the chi-squared distribution with one degree of freedom.
Coverage \(\kappa\) Critical value \(c\)
0.9 2.71
0.95 3.84
0.99 6.63

Table A.2 lists the critical values for the three most common choice of \(\kappa\) one degree of freedom. Note that these critical values are by construction the squares of the corresponding values in Table A.1 (disregarding the rounding errors).

A one-sided CI with nominal coverage \(\kappa\) is then given by the interval \([0, c ]\).