1 Overview of statistical learning

1.1 How to learn from data?

A fundamental question is how to extract information from data in an optimal way, and to make predictions based on this information.

For this purpose, a number of competing theories of information have been developed. Statistics is the oldest science of information and is concerned with offering principled ways to learn from data and to extract and process information using probabilistic models. However, there are other theories of information (e.g. Vapnik-Chernov theory of learning, computational learning) that are more algorithmic than analytic and sometimes not even based on probability theory.

Furthermore, there are other disciplines, such computer science and machine learning that are closely linked with and also have substantial overlap with statistics. The field of “data science” today comprises of both statistics and machine learning and brings together mathematics, statistics and computer science. Also the growing field of so-called “artificial intelligence” makes substantial use of statistical and machine learning techniques.

The recent popular science book “The Master Algorithm” by Domingos (2015) provides an accessible informal overview over the various schools of science of information. It discusses the main algorithms used in machine learning and statistics:

Starting as early as 1763, the Bayesian school of learning was started which later turned out to be closely linked with likelihood inference established in 1922 by R.A. Fisher (1890–1962) and generalised in 1951 to entropy learning by Kullback and Leibler.
It was also in the 1950s that the concept of artificial neural network arises, essentially a nonlinear input-output map that works in a non-probabilistic way. This field saw another leap in the 1980s and further progressed from 2010 onwards with the development of deep dearning. It is now one of the most popular (and most effective) methods for analysing imaging data. Even your mobile phone most likely has a dedicated computer chip with special neural network hardware, for example.
Further advanced theories of information were developed in the 1960 under the term of computational learning, most notably the Vapnik-Chernov theory, with the most prominent example of the “support vector machine” (another non-probabilistic model).
With the advent of large-scale genomic and other high-dimensional data there has been a surge of new and exciting developments in the field of high-dimensional (large dimension) and also big data (large dimension and large sample size), both in statistics and in machine learning.

The connections between various fields of information is still not perfectly understood, but it is clear that an overarching theory will need to be based on probabilistic learning.

1.2 Probability theory versus statistical learning

When you study statistics (or any other information theory) you need to be aware that there is a fundamental difference between probability theory and statistics, and that relates to the distinction between “randomness” and “uncertainty”.

Probability theory studies randomness, by developing mathematical models for randomness (such as probability distributions), and studying corresponding mathematical properties (including asymptotics etc). Probability theory may in fact be viewed as a branch of measure theory, and thus it belongs to the domain of pure mathematics.

Probability theory provides probabilistic generative models for data, for simulation of data or for use in learning from data, i.e. inference about the model from observations. Methods and theory how to best learn from data is the domain of applied mathematics, specifically statistics and the related areas of machine learning and data science.

Note that statistics, in contrast to probability, is in fact not at all concerned with randomness. Instead, the focus is about measuring and elucidating the uncertainty of events, predictions, outcomes, parameters and this uncertainty measures the state of knowledge. Note that if new data or information becomes available, the state of knowledge and thus the uncertainty changes! Thus, uncertainty is an epistemological property.

The uncertainty most often is due to our ignorance of the true underlying processes (on purpose or not), but not because the underlying process is actually random. The success of statistics is based on the fact that we can mathematically model the uncertainty without knowing any specifics of the underlying processes, and we still have procedures for optimal inference despite the uncertainty.

In short, statistics is about describing the state of knowledge of the world, which may be uncertain and incomplete, and to make decisions and predictions in the face of uncertainty, and this uncertaintly sometimes derives from randomness but most often from our ignorance (and sometimes this ignorance even helps to create a simple yet effective model)!

1.3 Cartoon of statistical learning

We observe data \(D = \{x_1, \ldots, x_n\}\) assumed to have been generated by an underlying true model \(M_{\text{true}}\) with true parameters \(\boldsymbol \theta_{\text{true}}\)

To explain the data, and make predictions, we make hypotheses in the form of candidate models \(M_{1}, M_{2}, \ldots\) and corresponding parameters \(\boldsymbol \theta_1, \boldsymbol \theta_2, \ldots\). The true model itself is unknown and cannot be observed. However, what we can observe is data \(D\) from the true model by measuring properties of objects interest (our observations from experiments). Sometimes we can also perturb the model and see what the effect is (interventional study).

The various candidate models \(M_1, M_2, \ldots\) in the model world will never be perfect or correct as the true model \(M_{\text{true}}\) will only be among the candidate models in an idealised situation. However, even an imperfect candidate model will often provide a useful mathematical approximation and capture some important characteristics of the true model and thus will help to interpret observed data.

\[ \begin{array}{cc} \textbf{Hypothesis} \\ \text{How the world works} \\ \end{array} \longrightarrow \begin{array}{cc} \textbf{Model world} \\ M_1, \boldsymbol \theta_1 \\ M_2, \boldsymbol \theta_2 \\ \vdots\\ \end{array} \] \[ \longrightarrow \begin{array}{cc} \textbf{Real world,} \\ \textbf{unknown true model} \\ M_{\text{true}}, \boldsymbol \theta_{\text{true}} \\ \end{array} \longrightarrow \textbf{Data } x_1, \ldots, x_n \]

The aim of statistical learning is to identify the model(s) that explain the current data and also predict future data (i.e. predict outcome of experiments that have not been conducted yet).

Thus a good model provides a good fit to the current data (i.e. it explains current observations well) and also to the future data (i.e. it generalises well).

A large proportion of statistical theory is devoted to finding these “good” models that avoid both overfitting (models being too complex and don’t generalise well) or underfitting (models being too simplistic and hence also don’t predict well).

Typically the aim is to find a model whose model complexity matches the complexity of the unknown true model and also the complexity of the data observed from the unknown true model.

1.4 Likelihood

In statistics and machine learning most models that are being used are probabilistic to take account of both randomness and uncertainty. A core task in statistical learning is to identify those models that explain the existing data well and that also generalise well to unseen data.

For this we need, among other things, a measure of how well a candidate model approximates the (typically unknown) true data generating model and an approach to choose the best model(s). One such approach is provided by the method of maximum likelihood that enables us to estimate parameters of models and to find the particular model that is the best fit to the data.

Given a probability distribution \(P_{\boldsymbol \theta}\) with density or mass function \(p(x|\boldsymbol \theta)\) where \(\boldsymbol \theta\) is a parameter vector, and \(D = \{x_1,\dots,x_n\}\) are the observed iid data (i.e. independent and identically distributed), the likelihood function is defined as \[ L_n(\boldsymbol \theta| D ) =\prod_{i=1}^{n} p(x_i|\boldsymbol \theta) \] Typically, instead of the likelihood one uses the log-likelihood function: \[ l_n(\boldsymbol \theta| D) = \log L_n(\boldsymbol \theta| D) = \sum_{i=1}^n \log p(x_i|\boldsymbol \theta) \] Reasons for preferring the log-likelihood (rather than likelihood) include that

the log-density is in fact the more “natural” and relevant quantity (this will become clear in the upcoming chapters) and that
addition is numerically more stable than multiplication on a computer.

For discrete random variables for which \(p(x |\boldsymbol \theta)\) is a probability mass function the likelihood is often interpreted as the probability to observe the data given the model with specified parameters \(\boldsymbol \theta\). In fact, this was indeed the way how the likelihood was historically introduced. However, this view is not strictly correct. First, given that the samples are iid and thus the ordering of the \(x_i\) is not important, an additional factor accounting for the possible permutations is needed in the likelihood to obtain the actual probability of the data. Moreover, for continuous random variables this interpretation breaks down due to the use of densities rather than probability mass functions in the likelihood. Thus, the view that the likelihood is the probability of the data is in fact too simplistic.

In the next chapter we will see that the justification for using likelihood rather stems from its close link to the Kullback-Leibler information and cross-entropy. This also helps to understand why using likelihood for estimation is only optimal in the limit of large sample size.

In the first part of the MATH28082 “Statistical Methods” module we will study likelihood estimation and inference in much detail. We will provide links to related methods of inference and discuss its information-theoretic foundations. We will also discuss the optimality properties as well as the limitations of likelihood inference. Extensions of likelihood analysis, in particular Bayesian learning, which will be discussed in the second part of the module. In the third part of the module we will apply statistical learning to linear regression.

Preface

2 From entropy to maximum likelihood