1 Statistical learning
Learning from data using probabilistic models lies at the heart of statistical learning. This chapter provides an overview over key foundational concepts.
1.1 What is statistics?
The following fundamental questions typically arise in any scientific data analysis:
Optimality: How do we extract information from data as efficiently and accurately as possible?
Model fit: How can we build models that accurately reflect the observed data?
Interpretability: How can we construct models that reveal underlying mechanisms and remain understandable?
Prediction: How do we use these models and information to make the best possible predictions?
Statistics is a mathematical science for reasoning about data and uncertainty. It employs probabilistic models to address the questions above, offering a principled framework for learning from data and extracting and for processing information under uncertainty in an optimal way. This includes model selection, approximation and inference.
Machine learning overlaps substantially with statistics. Rooted in computer science it frequently adopts an engineering-centric perspective and often emphasises algorithmic approaches. Some methods are non-probabilistic while many modern approaches adopt a statistical perspective.
Data science today comprises elements of both statistics and machine learning and brings together mathematics, computer science and domain-specific expertise (e.g. biomedical data science).
Artificial intelligence (AI) is a branch of computer science that makes substantial use of statistical and machine learning techniques, along with other methods (e.g. natural language processing and symbolic reasoning) to create systems that perceive, reason, and act.
1.2 Probabilistic underpinnings
Randomness, probability and uncertainty
Random and randomness refer to unpredictable, non-deterministic outcomes or events. Equivalent, more technical terms are stochastic and stochasticity.
The degree of randomness (or uncertainty, see below) is quantified by the probability, or equivalently, by the chance of particular outcomes.
An interesting question is the source of the randomness. On a fundamental level, some phenomena are intrinsically random (e.g. radioactive decay, measurement outcomes in quantum theory). However, much apparent randomness arises from our ignorance of the underlying mechanisms. The process may be deterministic in principle but we treat it as random for convenience. For example, a coin flip is often considered random. However, in reality the outcome of a coin flip is fully determined by classical physics.
Randomness that is not intrinsic but stems from a lack of knowledge or understanding, is called uncertainty, and corresponding events are uncertain. Uncertainty generally decreases if more data and information or a better model is available.
Probability theory versus statistics
It is important to recognise the distinct domains of probability theory and statistics.
On the one hand, probability theory provides the mathematical underpinnings of probability and chance (e.g. probability axioms, measure theory) and corresponding models for randomness and uncertainty (e.g. probability distributions, stochastic processes). Crucially, it is neutral about the sources of randomness and interpretations of probability, and may simply be viewed as pure mathematics.
On the other hand, statistics uses probabilistic approaches to learn from observations, thus linking real-world phenomena with mathematical models. Thus, it is a branch of applied mathematics. Importantly, statistics is concerned with uncertainty (e.g. about events, predictions, outcomes, model parameters) without assuming that the underlying process is actually random, and to make decisions and predictions under that uncertainty.
The link of statistics with the real world also leads to different interpretations of probability, with the two most common being the frequentist interpretation (ontological, “every probability is a long running frequency and exists independently from an observer”) and the Bayesian interpretation (epistemological, “probability is a degree of belief and represent the state of knowledge”), both to be discussed later.
1.3 Model-based learning
Sketch of statistical learning
The aim of statistical learning is to use observed data in an optimal way to learn about the underlying mechanism of the data-generating process. Since data is typically finite but models can be in principle arbitrarily complex there may be issues of over-fitting (insufficient data for the complexity of the model) but also under-fitting (model is too simplistic).
We observe data \(D = \{x_1, \ldots, x_n\}\) assumed to result from an underlying probabilistic model \(F\), the distribution for \(x\):
\[ \begin{array}{cc} \textbf{Real world} \\ \text{True model (unknown)} \\ F \\ \end{array} \longrightarrow \begin{array}{cc} \textbf{Data}\\ \text{Samples from true model} \\ D = \{x_1, \ldots, x_n\}\\ x_i \sim F\\ \end{array} \] The true model underlying the data-generating process is unknown and cannot be observed. However, what we can observe is data \(D\) arising from the true model \(F\) by measuring properties of interest (our observations from experiments). Sometimes we can also perturb the model and see what the effect is (interventional study).
To explain the observed data \(D\), and also to predict future data, we will make hypotheses in the form of candidate models \(P_{1}, P_{2}, \ldots\). Often these candidate models form a model family \(P_{\boldsymbol \theta}\) indexed by a parameter vector \(\boldsymbol \theta\), with specific values for each model so that we can also write \(P_{\boldsymbol \theta_1}, P_{\boldsymbol \theta_2}, \ldots\) for the various models.
Frequently parameters are chosen such that they allow some interpretation, such as moments or other properties of the distribution. However, intrinsically parameters are just labels and may be changed by any one-to-one transformation. For statistical learning it is necessary that models are identifiable within a family, i.e. each distinct model is identified by a unique parameter so that \(P_{\boldsymbol \theta_1} = P_{\boldsymbol \theta_2}\) implies \(\boldsymbol \theta_1 = \boldsymbol \theta_2\), and conversely if \(P_{\boldsymbol \theta_1} \neq P_{\boldsymbol \theta_2}\) then \(\boldsymbol \theta_1 \neq \boldsymbol \theta_2\).
The various candidate models \(P_{\boldsymbol \theta}\) in the model world will at best be good approximations to the true underlying data-generating model \(F\). In some cases the true model will be part of the model family, i.e. there exists a parameter \(\boldsymbol \theta_{\text{true}}\) so that \(F = P_{\boldsymbol \theta_{\text{true}}}\). However, more typically we cannot assume that the true underlying model is contained in the family. Nonetheless, even an imperfect candidate model will often provide a useful mathematical approximation and capture some important characteristics of the true model and thus will help to interpret the observed data.
\[ \begin{array}{cc} \textbf{Statistical Learning}\\ \text{Find model(s) and parameters} \\ \text{approximating the true model} \\ \text{and best explaining both} \\ \text{observed and future data} \\ F \approx P_{\hat{\boldsymbol \theta}}\\ \end{array} \longleftarrow \begin{array}{cc} \textbf{Model world} \\ \text{Hypotheses about}\\ \text{data-generating process:} \\ \text{Model } P_{\boldsymbol \theta}\\ \text{with parameter(s) } \boldsymbol \theta\\ \text{Ideally, true model is } \\ F = P_{\boldsymbol \theta_{\text{true}}}\\ \end{array} \]
The aim of statistical learning is to identify the model(s) that explain the current data and also predict future data (i.e. predict outcome of experiments that have not been conducted yet).
Thus a good model provides a good fit to the current data (i.e. it explains current observations well) and also to the future data (i.e. it generalises well).
A large proportion of statistical theory is devoted to finding these “good” models that avoid both over-fitting (models being too complex and not generalising well) or under-fitting (models being too simplistic and hence also not predicting well).
Typically the aim is to find an approximating model whose model complexity is well matched with the complexity of the unknown true model and also with the complexity of the observed data.
Finding the best models
A core task in statistical learning is to identify those distributions that explain the existing data well and that also generalise well to future yet unseen observations.
In a non-parametric setting we may simply rely on the law of large numbers that implies that the empirical distribution \(\hat{F}_n\) constructed from the observed data \(D\) converges to the true distribution \(F\) if the sample size is large. We can therefore obtain an empirical estimator \(\hat{\theta}\) of the functional \(\theta = g(F)\) by \(\hat{\theta}= g( \hat{F}_n )\), i.e. by substituting the true distribution with the empirical distribution. This allows us, e.g., to get the empirical estimate of the mean \(\text{E}_{F}(x) = \mu\) by \[ \hat{\text{E}}(x) = \hat{\mu} = \text{E}_{\hat{F}_n}(x) = \frac{1}{n} \sum_{i=1}^n x_i = \bar{x} \] and of the variance \(\text{Var}(x) = \sigma^2 = \text{E}_{F}((x - \mu)^2)\) by \[ \widehat{\text{Var}}(x) = \widehat{\sigma^2} = \text{E}_{\hat{F}_n}((x - \hat{\mu})^2) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \] simply by replacing the expectation with the sample average.
For parametric models we need to find estimates of the parameters that correspond to the distributions that best approximate the unknown true data-generating model. One such approach is provided by the method of maximum likelihood. More precisely, given a probability distribution \(P_{\boldsymbol \theta}\) with density or mass function \(p(x|\boldsymbol \theta)\) where \(\boldsymbol \theta\) is a parameter vector, and \(D = \{x_1,\dots,x_n\}\) are the observed iid data (i.e. independent and identically distributed), the likelihood function is defined as \[ L_n(\boldsymbol \theta) = L(\boldsymbol \theta| D ) =\prod_{i=1}^{n} p(x_i|\boldsymbol \theta) \] The parameter value \(\hat{\boldsymbol \theta}_{ML}\) that maximises the likelihood function for fixed data \(D\) is the maximum likelihood estimate: \[ \hat{\boldsymbol \theta}_{ML} = \underset{\boldsymbol \theta}{\arg \max}\, L_n(\boldsymbol \theta) \]
Historically, the likelihood was introduced as the probability to observe the data given the model with specified parameters \(\boldsymbol \theta\). However, this view is incorrect as this interpretation of the likelihood breaks down for continuous random variables which use densities rather than probabilities in the likelihood. Furthermore even for discrete random variables an additional factor accounting for the possible permutations of samples is needed to obtain the actual probability of the data. Instead, as will soon become evident, the basis of the method of maximum likelihood is fundamentally linked to entropy.
Specifically, we will see that the likelihood is closely linked to the cross-entropy between the unknown true distribution \(F\) and the model \(P_{\boldsymbol \theta}\). As a consequence the method of maximum likelihood extends empirical estimation to parametric models1. This insight illuminates both the optimality characteristics as well as the limitations of the maximum likelihood approach to statistical learning.
Models and decomposition of uncertainty
When constructing statistical models we will often choose to explain some aspects of the uncertainty while intentionally ignoring others, to create simple yet effective models.
As a result, for any given model, uncertainty decomposes into the sum of
- reducible uncertainty (epistemic uncertainty): explained by model, and
- irreducible uncertainty (aleatoric / residual / intrinsic uncertainty): unexplained by model.
Crucially, even with arbitrarily large amounts of data, the total uncertainty cannot always be eliminated fully, since the residual uncertainty depends on the employed model. Consequently, to reduce the unexplained uncertainty the model itself must be changed, but whether this is at all desirable is a different matter (taking into account model complexity, interpretability, etc.).
For example, in linear regression or classification, the decomposition of uncertainty is expressed by the law of total variance with decomposes total variance into explained variance (“signal”, between-group variance) and unexplained variance (“noise”, within-group variance). Additional data improves the accuracy of estimates of the residual and the explained variance but does not eliminate the model’s unexplained variance. To reduce the residual error you need to change the model, e.g. by adding further covariates.
Importantly, intrinsic uncertainty should not be confused with intrinsic randomness: the latter is a claim about nature (fundamental non-determinism), whereas the first is a property of the model (residual uncertainty). Hence, unexplained uncertainty does not imply intrinsic randomness.
1.4 Further reading
The book “Ten Great Ideas About Chance” by Diaconis and Skyrms (2018) offers a gentle introduction to the history and and philosophical foundations of probability as well as to the frequentist and Bayesian interpretations.
The book “Probability Theory: The Logic of Science” by Jaynes (2003) advocates that probability theory, as an extension of logic, is the natural framework for scientific reasoning.
The popular science book “The Theory That Would Not Die” McGrayne (2011) focuses on the history of Bayes’ theorem and its importance in statistics. Similarly, “The Master Algorithm” by Domingos (2015) provides an informal overview over the various schools of information science.
For a quick recap of essential statistical concepts introduced in earlier statistical modules in year 1 and 2 see Appendix A.
Some important milestones in the development of learning from data are highlighted below:
Bayesian statistics dates back to Thomas Bayes’s 1763 essay and was further developed by Laplace in the early 19th century.
Maximum likelihood was developed by R. A. Fisher in the early 20th century, with a seminar paper published in 1922.
Links of statistical learning with entropy were established in the 1940s with roots going back to discoveries in statistical physics in the 1870s. The close link of physics and statistical learning has recently been underlined by the 2024 Nobel Prize in Physics awarded to J. J. Hopfield and G. Hinton for advances in artificial neural networks.
The first artificial neural network model appeared in the 1950s as a nonlinear input-output mapping without probabilistic foundations. Progress continued through the 1980s and accelerated after 2010 with deep learning. Today neural networks are widely used for complex data analysis and underpin many generative AI systems. Modern views treat them as high-dimensional nonlinear statistical models.
In the 1960s the theoretical foundations for algorithmic or computational learning were developed under the umbrella the Vapnik-Chernov theory, with the most prominent example of the “support vector machine” (another non-probabilistic model) devised in the 1990s. Other important advances include “ensemble learning” and corresponding algorithmic approaches to classification such as “random forests”.
Classical statistics focused on settings with few variables and large sample sizes. Since about 2000, large‑scale genomics and other high‑dimensional datasets have driven rapid advances in statistics and machine learning to develop new methods to handle high‑dimensional data (many variables with moderate sample sizes) and big data (many variables and large sample sizes).
Conversely, empirical estimators are, in fact, also likelihood estimators based on an empirical likelihood function constructed from the empirical distribution.↩︎