1 Statistical learning

1.1 How to learn from data?

The two fundamental questions when learning from data are

how to extract information from data in an optimal way, and
how to make the best possible predictions based on this information.

To achieve this goal, several theories of information have been emerging. Statistics is the oldest science of information and is concerned with using probabilistic models to offer a principled ways to learn from data and to extract and process information under uncertainty. However, there are various other theories of information, with some emphasising approximation while others concentrate on algorithmic approaches and on non-probabilistic methods. The domain of machine learning shares significant overlap with statistics. Rooted in computer science rather than mathematics, it frequently adopts a more engineering-centric perspective. Artificial intelligence (AI) is a branch of computer science that makes substantial use of statistical and machine learning techniques. The emerging field of data science today comprises both statistics and machine learning and brings together mathematics, computer science and area-specific applications, such as biomedical data science.

Some important milestones in the development of learning from data are highlighted below:

Starting as early as 1763, the Bayesian school of learning was started which later turned out to be closely linked with the theory of likelihood estimation formulated in 1922.
Links of statistical learning with entropy were established in the 1940s with roots going back to discoveries in statistical physics in the 1870s. The close link of physics and statistical learning has recently been underlined by the 2024 Nobel Prize in Physics awarded to J. J. Hopfield and G. Hinton for advances in artificial neural networks.
It was also in the 1950s that the first model of artificial neural network arises, essentially a nonlinear input-output map with no underlying probabilistic modelling. This field saw another leap in the 1980s and further progressed from 2010 onwards with the development of deep learning. It is one of the most popular (and most effective) methods for analysing complex data and also underlies many generative AI models. Despite their non-probabilistic origins, modern interpretations of neural networks now view them as high-dimensional nonlinear statistical models.
In the 1960s further advanced theories of information were developed under the umbrella of computational learning, most notably the Vapnik-Chernov theory, with the most prominent example of the “support vector machine” (another non-probabilistic model) devised in the 1990s. Other important advances include “ensemble learning” and corresponding algorithmic approaches to classification such as “random forests”.
Classical statistics has focused on data sets with a low number of variables and a large sample size. With the advent of large-scale genomics and the availability other high-dimensional data in the last 20 years there has been a surge of developments in both statistics and in machine learning to develop new methods to analyse high-dimensional data (large dimension, large number of variables) and big data (large dimension as well as large sample size).

1.2 Randomness versus uncertainty

When exploring statistics (or any other field related to information) it is important to recognise that there is a fundamental difference between probability theory and statistics, and that relates to the distinction between “randomness” and “uncertainty”.

On the one hand, probability theory studies randomness, by developing mathematical models for randomness (such as probability distributions), and studying corresponding mathematical properties such as asymptotic behaviour. Probability theory can be viewed as a branch of measure theory, and as such it belongs to the domain of pure mathematics.

On the other hand, statistics, and related areas of machine learning and data science, is not at all concerned with randomness. Instead the focus is on learning from data using mathematical models that represent our understanding about the world. Hence, statistics uses probability as a tool to describe uncertainty. Importantly, that uncertainty (e.g. about events, predictions, outcomes, model parameters) is mostly due to our ignorance and lack of knowledge of the true underlying processes but not necessarily because the underlying process is actually random. As soon as new data or information becomes available, the state of knowledge and the uncertainty changes. Hence uncertainty is an epistemological property. The enormous success of statistical methods is indeed due to the fact that they provide optimal procedures for learning from data and the same time allow to model and update this uncertainty.

In short, statistics is about describing the state of knowledge of the world, which may be uncertain and incomplete, and to make decisions and predictions in face of uncertainty. This uncertainty can stem from randomness but more frequently arises from our ignorance (and sometimes this ignorance even helps to create a simple yet effective model).

1.3 Probabilistic models, data and statistical learning

The aim of statistical learning is to use observed data in an optimal way to learn about the underlying mechanism of the data-generating process. Since data is typically finite but models can be in principle arbitrarily complex there may be issues of over-fitting (insufficient data for the complexity of the model) but also under-fitting (model is too simplistic).

We observe data \(D = \{x_1, \ldots, x_n\}\) assumed to result from an underlying true data-generating model \(F_{\text{true}}\), the distribution for \(x\).

To explain the observed data, and also to predict future data, we will make hypotheses in the form of candidate models \(P_{1}, P_{2}, \ldots\). Often these candidate models form a model family \(P_{\boldsymbol \theta}\) indexed by a parameter vector \(\boldsymbol \theta\), with specific values for each model so that we can also write \(P_{\boldsymbol \theta_1}, P_{\boldsymbol \theta_2}, \ldots\) for the various models.

Frequently parameters are chosen such that they allow some interpretation, such as moments or other properties of the distribution. However, intrinsically parameters are just labels and may be changed by any one-to-one transformation. For statistical learning it is necessary that models are identifiable within a family, i.e. each distinct model is identified by a unique parameter so that \(P_{\boldsymbol \theta_1} = P_{\boldsymbol \theta_2}\) implies \(\boldsymbol \theta_1 = \boldsymbol \theta_2\), and conversely if \(P_{\boldsymbol \theta_1} \neq P_{\boldsymbol \theta_2}\) then \(\boldsymbol \theta_1 \neq \boldsymbol \theta_2\).

The true model underlying the data generating process is unknown and cannot be observed. However, what we can observe is data \(D\) from the true model \(F\) by measuring properties of interest (our observations from experiments). Sometimes we can also perturb the model and see what the effect is (interventional study).

The various candidate models \(P_{\boldsymbol \theta}\) in the model world will at best be good approximations to the true underlying data generating model \(F\). In some cases the true model will be part of the model family, i.e. there exists a parameter \(\boldsymbol \theta_{\text{true}}\) so that \(F = P_{\boldsymbol \theta_{\text{true}}}\). However, more typically we cannot assume that the true underlying model is contained in the family. Nonetheless, even an imperfect candidate model will often provide a useful mathematical approximation and capture some important characteristics of the true model and thus will help to interpret the observed data.

\[ \begin{array}{cc} \textbf{Real world} \\ \text{true model (unknown)} \\ F \\ \end{array} \longrightarrow \begin{array}{cc} \textbf{Data}\\ \text{Samples from true model} \\ x_1, \ldots, x_n\\ \end{array} \] \[ \begin{array}{cc} \textbf{Statistical Learning}\\ \text{Find model(s) and parameters} \\ \text{approximating the true model} \\ \text{and best explaining both} \\ \text{observed and future data} \\ F \approx P_{\hat{\boldsymbol \theta}}\\ \end{array} \longleftarrow \begin{array}{cc} \textbf{Model world} \\ \text{Hypotheses about}\\ \text{data generating process:} \\ \text{Model } P_{\boldsymbol \theta}\\ \text{with parameter(s) } \boldsymbol \theta\\ \text{Ideally, true model is } \\ F = P_{\boldsymbol \theta_{\text{true}}}\\ \end{array} \]

The aim of statistical learning is to identify the model(s) that explain the current data and also predict future data (i.e. predict outcome of experiments that have not been conducted yet).

Thus a good model provides a good fit to the current data (i.e. it explains current observations well) and also to the future data (i.e. it generalises well).

A large proportion of statistical theory is devoted to finding these “good” models that avoid both over-fitting (models being too complex and not generalising well) or under-fitting (models being too simplistic and hence also not predicting well).

Typically the aim is to find an approximating model whose model complexity is well matched with the complexity of the unknown true model and also with the complexity of the observed data.

1.4 Finding the best models

A core task in statistical learning is to identify those distributions that explain the existing data well and that also generalise well to future yet unseen observations.

In a non-parametric setting we may simply rely on the law of large numbers that implies that the empirical distribution \(\hat{F}_n\) constructed from the observed data \(D\) converges to the true distribution \(F\) if the sample size is large. We can therefore obtain an empirical estimator \(\hat{\theta}\) of the functional \(\theta = g(F)\) by \(\hat{\theta}= g( \hat{F}_n )\), i.e. by substituting the true distribution with the empirical distribution. This allows us, e.g., to get the empirical estimate of the mean \(\text{E}_{F}(x) = \mu\) by \[ \hat{\text{E}}(x) = \hat{\mu} = \text{E}_{\hat{F}_n}(x) = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \] and of the variance \(\text{Var}(x) = \sigma^2 = \text{E}_{F}((x - \mu)^2)\) by \[ \widehat{\text{Var}}(x) = \widehat{\sigma^2} = \text{E}_{\hat{F}_n}((x - \hat{\mu})^2) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \] simply by replacing the expectation with the sample average.

For parametric models we need to find estimates of the parameters that correspond to the distributions that best approximate the unknown true data generating model. One such approach is provided by the method of maximum likelihood. More precisely, given a probability distribution \(P_{\boldsymbol \theta}\) with density or mass function \(p(x|\boldsymbol \theta)\) where \(\boldsymbol \theta\) is a parameter vector, and \(D = \{x_1,\dots,x_n\}\) are the observed iid data (i.e. independent and identically distributed), the likelihood function is defined as \[ L_n(\boldsymbol \theta| D ) =\prod_{i=1}^{n} p(x_i|\boldsymbol \theta) \] The parameter value \(\hat{\boldsymbol \theta}_{ML}\) that maximises the likelihood function for fixed data \(D\) is the maximum likelihood estimate: \[ \hat{\boldsymbol \theta}_{ML} = \underset{\boldsymbol \theta}{\arg \max}\, L_n(\boldsymbol \theta|D) \]

Historically, the likelihood was introduced as the probability to observe the data given the model with specified parameters \(\boldsymbol \theta\). However, this view is incorrect as this interpretation of the likelihood breaks down for continuous random variables which use densities rather than probabilities in the likelihood. Furthermore even for discrete random variables an additional factor accounting for the possible permutations of samples is needed to obtain the actual probability of the data. Instead, as will soon become evident, the basis of the method of maximum likelihood is fundamentally linked to entropy.

Specifically, we will see that the likelihood is closely linked to the cross-entropy between the unknown true distribution \(F\) and the model \(P_{\boldsymbol \theta}\). As a consequence the method of maximum likelihood extends empirical estimation to parametric models¹. This insight illuminates both the optimality characteristics as well as the limitations of the maximum likelihood approach to statistical learning.

1.5 Further reading

The popular science book “The Theory That Would Not Die” McGrayne (2011) focuses on the history of Bayes’ theorem and its importance in statistics. In a similar fashion, “The Master Algorithm” by Domingos (2015) provides an informal overview over the various schools of information science.

The book “Ten Great Ideas About Chance” by Diaconis and Skyrms (2018) offers a gentle introduction to various viewpoints of probability, discussing randomness and uncertainty including the frequentist ontological perspective versus the epistemological Bayesian perspective.

For a quick recap of essential statistical concepts introduced in earlier statistical modules in year 1 and 2 see Appendix A.

Conversely, empirical estimators are, in fact, also likelihood estimators based on an empirical likelihood function constructed from the empirical distribution.↩︎