1  Overview of statistical learning

1.1 How to learn from data?

A fundamental question is how to extract information from data in an optimal way, and to make predictions based on this information.

For this purpose, a number of competing theories of information have been developed. Statistics is the oldest science of information and is concerned with offering principled ways to learn from data and to extract and process information using probabilistic models. However, there are other theories of information (e.g. Vapnik-Chernov theory of learning, computational learning) that are more algorithmic than analytic and sometimes not even based on probability theory.

Furthermore, there are other disciplines, such computer science and machine learning that are closely linked with and also have substantial overlap with statistics. The field of “data science” today comprises of both statistics and machine learning and brings together mathematics, statistics and computer science. Also the growing field of so-called “artificial intelligence” makes substantial use of statistical and machine learning techniques.

The recent popular science book “The Master Algorithm” by Domingos (2015) provides an accessible informal overview over the various schools of science of information. It discusses the main algorithms used in machine learning and statistics:

  • Starting as early as 1763, the Bayesian school of learning was started which later turned out to be closely linked with likelihood inference established in 1922 by Ronald A. Fisher (1890–1962) and generalised in 1951 to entropy learning by Kullback and Leibler.

  • It was also in the 1950s that the concept of artificial neural network arises, essentially a nonlinear input-output map with no underlying probabilistic modelling. This field saw another leap in the 1980s and further progressed from 2010 onwards with the development of deep dearning. It is now one of the most popular (and most effective) methods for analysing imaging data. Even your mobile phone most likely now has a dedicated computer chip with special neural network hardware. Despite their non-probabilistic origins, modern interpretations of neural networks view them as high-dimensional nonlinear statistical models.

  • Further advanced theories of information were developed in the 1960 under the term of computational learning, most notably the Vapnik-Chernov theory, with the most prominent example of the “support vector machine” (another non-probabilistic model) devised in the 1990s. Other important advances include “ensemble learning” and corresponding algorithmic approaches to classification such as “random Forests”.

  • With the advent of large-scale genomic and other high-dimensional data there has been a surge of new and exciting developments in the field of high-dimensional (large dimension) and also big data (large dimension and large sample size), both in statistics and in machine learning.

The connections between various fields of information is still not perfectly understood, but it is clear that an overarching theory will need to be based on probabilistic learning.

1.2 Probability theory versus statistical learning

When you study statistics (or any other information theory) you need to be aware that there is a fundamental difference between probability theory and statistics, and that relates to the distinction between “randomness” and “uncertainty”.

Probability theory studies randomness, by developing mathematical models for randomness (such as probability distributions), and studying corresponding mathematical properties such as asymptotic behaviour. Probability theory can be viewed as a branch of measure theory, and as such it belongs to the domain of pure mathematics.

On the other hand, statistics, and related areas of machine learning and data science, is not at all concerned with randomness. Instead the focus is on learning from data using mathematical models that represent our understanding about the world. Statistics uses probability as a tool to describe uncertainty. But this uncertainty (e.g. about events, predictions, outcomes, model parameters) is mostly due to our ignorance and lack of knowledge of the true underlying processes but not because the underlying process is actually random. As soon as new data or information becomes available, the state of knowledge and the the uncertainty changes, and hence uncertainty is an epistemological property. The enormous success of statistical methods is indeed due to the fact that they provide optimal procedures for learning from data and the same time allow to model and update the uncertainty representing our ignorance.

In short, statistics is about describing the state of knowledge of the world, which may be uncertain and incomplete, and to make decisions and predictions in the face of uncertainty, and this uncertainty sometimes derives from randomness but most often from our ignorance (and sometimes this ignorance even helps to create a simple yet effective model).

1.3 Cartoon of statistical learning

The aim of statistical learning is to use observed data in an optimal way to learn about the underlying mechanism of the data-generating process. Since data is typically finite but models can be in principle arbitrarily complex there may be issues of overfitting (insufficient data for the complexity of the model) but also underfitting (model is too simplistic).

We observe data \(D = \{x_1, \ldots, x_n\}\) assumed to result from an underlying true data-generating model \(F_{\text{true}}\), the distribution for \(x\).

To explain the observed data, and also to predict future data, we will make hypotheses in the form of candidate models \(P_{1}, P_{2}, \ldots\). Often these candidate models form a model family \(P_{\symbfit \theta}\) indexed by a parameter vector \(\symbfit \theta\), with specific values for each model so that we can also write \(P_{\symbfit \theta_1}, P_{\symbfit \theta_2}, \ldots\) for the various models.

Frequently parameters are chosen such that they allow some interpretation, such as moments or other properties of the distribution. However, intrinsically parameters are just labels and may be changed by any one-to-one transformation. For statistical learning it is necessary that models are identifiable within a family, i.e. each distinct model is identified by a unique parameter so that \(P_{\symbfit \theta_1} = P_{\symbfit \theta_2}\) implies \(\symbfit \theta_1 = \symbfit \theta_2\), and conversely if \(P_{\symbfit \theta_1} \neq P_{\symbfit \theta_2}\) then \(\symbfit \theta_1 \neq \symbfit \theta_2\).

The true model underlying the data generating process is unknown and cannot be observed. However, what we can observe is data \(D\) from the true model \(F_{\text{true}}\) by measuring properties of interest (our observations from experiments). Sometimes we can also perturb the model and see what the effect is (interventional study).

The various candidate models \(P_1, P_2, \ldots\) in the model world will at best be good approximations to the true underlying data generating model \(F_{\text{true}}\). In some cases the true model will be part of the model family, i.e. there exists a parameter \(\symbfit \theta_{\text{true}}\) so that \(F_{\text{true}} = P_{\symbfit \theta_{\text{true}}}\). However, more typically we cannot assume that the true underlying model is contained in the family. Nonetheless, even an imperfect candidate model will often provide a useful mathematical approximation and capture some important characteristics of the true model and thus will help to interpret the observed data.

\[ \begin{array}{cc} \textbf{Hypothesis} \\ \text{How the world works} \\ \end{array} \longrightarrow \begin{array}{cc} \textbf{Model world} \\ P_1, \symbfit \theta_1 \\ P_2, \symbfit \theta_2 \\ \vdots\\ \end{array} \] \[ \longrightarrow \begin{array}{cc} \textbf{Real world,} \\ \textbf{unknown true model} \\ F_{\text{true}}, \symbfit \theta_{\text{true}} \\ \end{array} \longrightarrow \textbf{Data } x_1, \ldots, x_n \]

The aim of statistical learning is to identify the model(s) that explain the current data and also predict future data (i.e. predict outcome of experiments that have not been conducted yet).

Thus a good model provides a good fit to the current data (i.e. it explains current observations well) and also to the future data (i.e. it generalises well).

A large proportion of statistical theory is devoted to finding these “good” models that avoid both overfitting (models being too complex and don’t generalise well) or underfitting (models being too simplistic and hence also don’t predict well).

Typically the aim is to find a model whose model complexity is well matched with the complexity of the unknown true model and also with the complexity of the observed data.

1.4 Common distributions used in statistical models

Models employed in statistical analysis are typically multivariate comprising many random variables. As such these models can be very complex, with hierarchical or network-like structures linking observed and latent variables, and possibly exhibiting nonlinear functional relationships.

However, nonetheless even the most complex models will normally be composed of more elementary building blocks. For example, the following parametric distributions frequently occur in statistical analysis:

  • Bernoulli distribution \(\text{Ber}(\theta)\) and categorical distribution \(\text{Cat}(\symbfit \pi)\): used to model frequencies (on the support \([0,1]\)). Repeated application yields the binomial distribution \(\text{Bin}(n, \theta)\) and multinomial distribution \(\text{Mult}(n, \symbfit \pi)\).

  • Normal distribution in both the univariate \(N(\mu, \sigma^2)\) and multivariate \(N(\symbfit \mu, \symbfit \Sigma)\) version: commonly used to model mean values (on the support \(]-\infty, \infty[\)).

  • Gamma distribution \(\text{Gam}(\alpha, \theta)\): used to model scale factors (on the support \([0, \infty[\)). It is also known (with different parameterisation) as univariate Wishart distribution \(W_1\left(s^2, k \right)\) or as scaled chi-squared distribution \(s^2 \text{$\chi^2_{k}$}\). Special cases include the chi-squared distribution \(\text{$\chi^2_{k}$}\) and the exponential distribution \(\text{Exp}(\theta)\).

The above distribution families are so-called exponential families. As such they can all be written in the same structural form and share many useful properties that make them ideal models.

Another commonly used parametric model is a generalisation of the normal distribution but with more probability mass in the tails:

  • Location-scale \(t\)-distribution \(\text{lst}(\mu, \tau^2, \nu)\): similar to the normal distribution \(N(\mu, \sigma^2)\) but with heavier tails. It emerges as the sampling distribution for the \(t\)-statistic and as compound distribution in Bayesian learning. Special cases include the Student’s \(t_\nu\) distribution and Cauchy distribution \(\text{Cau}(\mu, \tau)\). Due to its heavy tails, depending on the choice of the degrees of freedom \(\nu\), not all moments of the distribution may exist. Furthermore, it’s not an exponential family.

Finally, nonparametric models are also often used to describe and analyse the observed data. Rather than specifying a parametric model for \(F\) one focuses on using the whole distribution to define meaningful statistical functionals \(\theta = g(F)\), such as the mean and the variance.

In the second part of this module (Bayesian statistics) we will encounter further distributions such as the beta distribution or inverse gamma distribution (both exponential families). In later study year subsequent modules will introduce more complex models, related to temporal and spatial modelling, generalised linear models (regression), and multivariate statistics and machine learning.

1.5 Finding the best models

A core task in statistical learning is to identify those distributions that explain the existing data well and that also generalise well to future yet unseen observations.

In a nonparametric setting we may simply rely on the law of large numbers that implies that the empirical distribution \(\hat{F}_n\) constructed from the observed data \(D\) converges to the true distribution \(F\) if the sample size is large. We can therefore obtain an empirical estimator \(\hat{\theta}\) of the functional \(\theta = g(F)\) by \(\hat{\theta}= g( \hat{F}_n )\), i.e. by substituting the true distribution with the empirical distribution. This allows us, e.g., to get the empirical estimate of the mean \[ \hat{\text{E}}(x) = \hat{\mu} = \text{E}_{\hat{F}_n}(x) = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \] and of the variance \[ \widehat{\text{Var}}(x) = \widehat{\sigma^2} = \text{E}_{\hat{F}_n}((x - \hat{\mu})^2) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \] simply by replacing the expectation with the sample average.

For parametric models we need to find estimates of the parameters that correspond to the distributions that best approximate the unknown true data generating model. One such approach is provided by the method of maximum likelihood. More precisely, given a probability distribution \(P_{\symbfit \theta}\) with density or mass function \(p(x|\symbfit \theta)\) where \(\symbfit \theta\) is a parameter vector, and \(D = \{x_1,\dots,x_n\}\) are the observed iid data (i.e. independent and identically distributed), the likelihood function is then defined as \[ L_n(\symbfit \theta| D ) =\prod_{i=1}^{n} p(x_i|\symbfit \theta) \] The parameter that maximises the likelihood is the maximum likelihood estimate.

The first part of this module is devoted to exploring the method of maximum likelihood both practically and more theoretically. We start by considering the justification of the method of maximum likelihood. Historically, the likelihood function was introduced (and still often is interpreted) as the probability to observe the data given the model with specified parameters \(\symbfit \theta\). However, this view is incorrect as it not only breaks down for continuous random variables due to the use of densities and even for discrete random variables an additional factor accounting for the possible permutations of samples is needed to obtain the actual probability of the data. Instead, the true foundation of maximum likelihood lies in information theory, specifically in its close link with the KL divergence and the cross-entropy between the unknown true distribution \(F\) and the model \(P_{\symbfit \theta}\). As a result we will see that maximum likelihood extends empirical estimation to parametric models. This insight allows to shed light both on the optimality properties as well as the limitations of maximum likelihood inference.

In the second part we then introduce the Bayesian approach to statistical estimation and inference that can be viewed as a natural extension of likelihood-based statistical analysis that overcomes some of the limitations of maximum likelihood.

The aim of this module is therefore

  1. to provide a principled introduction to maximum likelihood and Bayesian statistical analysis and
  2. to demonstrate that statistics offers a well founded and coherent theory of information, rather than just seemingly unrelated collections of “recipes” for data analysis (a still widespread but wrong perception of statistics).