3 Unsupervised learning and clustering

3.1 Challenges in supervised learning

3.1.1 Objective

We observe data \(\boldsymbol x_1, \ldots, \boldsymbol x_n\) for \(n\) objects (or subjects). Each sample \(\boldsymbol x_i\) is a vector of dimension \(d\). Thus, for each of the \(n\) objects / subjects we have measurements on \(d\) variables. The aim of unsupervised learning is to identify patters relating the objects/subjects based on the information available in \(\boldsymbol x_i\). Note that unsupervised learning only uses this information and nothing else.

For illustration consider the first two principal components of the Iris flower data (see e.g. Worksheet 4):

Clearly there is a group structure among the samples that is linked to particular patterns in the first two principal components.

Note that in this plot we have used additional information, the class labels (setosa, versicolor, virginica), to highlighting the true underlying structure (the three flower species).

In unsupervised learning the class labels are (assumed to be) unknown, and the aim is to infer the clustering and thus the classes labels.8

There are many methods for clustering and unsupervise learning, both purely algorithmic as well as probabilistic. In this chapter we will study a few of the most commonly used approaches.

3.1.2 Questions and problems

In order to implement unsupervised learning we need to address a number of questions:

  • how do we define clusters?
  • how do we learn / infer clusters?
  • how many clusters are there? (this is surprisingly difficult!)
  • how can we assess the uncertainty of clusters?

Once we know the clusters we are also interested in:

  • which features define / separate each cluster?

(note this is a feature / variable selection problem, discussed in in supervised learning).

Many of these problems and questions are highly specific to the data at hand. Correspondingly, there are many different types of methods and models for clustering and unsupervised learning.

In terms of representing the data, unsupervised learning tries to balance between the following two extremes:

  1. all objects are grouped into a single cluster (low complexity model)
  2. all objects are put into their own cluster (high complexity model)

In practise, the aim is to find a compromise, i.e. a model that captures the structure in the data with appropriate complexity — not too low and not too complex.

3.1.3 Why is clustering difficult?

Partioning problem (combinatorics): How many partitions of \(n\) objects (say flowers) into \(K\) groups (say species) exists?


\[ S(n,K) = \left\{\begin{array}{l} n \\ K \end{array} \right\} \] this is the “Sterling number of the second type”.

For large n: \[ S(n,K) \approx \frac{K^n }{ K!} \] Example:

\(n\) \(K\) Number of possible partitions
15 3 \(\approx\) 2.4 million (\(10^6\))
20 4 \(\approx\) 2.4 billion (\(10^9\))
100 5 \(\approx 6.6 \times 10^{76}\)

These are enormously big numbers even for relatively small problems!

\(\Longrightarrow\) Clustering / partitioning / structure discovery is not easy!

\(\Longrightarrow\) We cannot expect perfect answers or a single “true” clustering

In fact, as a model of the data many differnt clusterings may fit the data equally well.

\(\Longrightarrow\) We need to assesse the uncertainty of the clustering

This can be done as part of probabilistic modelling or by resampling (e.g., bootstrap).

3.1.4 Common types of clustering methods

There are very many different clustering algorithms!

We consider the following two broad types of methods:

  1. Algorithmic clustering methods (these are not explicitly based on a probabilistic model)
  • \(K\)-means
  • PAM
  • hierarchical clustering (distance or similarity-based, divise and agglomerative)

pros: fast, effective algorithms to find at least some grouping cons: no probabilistic interpretation, blackbox methods

  1. Model-based clustering (based on a probabilistic model)
  • mixture models (e.g. Gaussian mixture models, GMMs, non-hierarchical)
  • graphical models (e.g. Bayesian networks, Gaussian graphical models GGM, trees and networks)

pros: full probabilistic model with all corresponding advantages cons: computationally very expensive, sometimes impossible to compute exactly.

3.2 Hierarchical clustering

3.2.1 Tree-like structures

Often, categorisations of objects are nested, i.e. there sub-categories of categories etc. These can be naturally represented by tree-like hierarchical structures.

In many branches of science hierarchical clusterings are widely employed, for example in evolutionary biology: see e.g. 

  • Tree of Life explaining the biodiversity of life
  • phylogenetic trees among species (e.g. vertebrata)
  • population genetic trees to describe human evolution
  • taxonomic trees for plant species
  • etc.

Note that when visualising hierarchical structures typically the corresponding tree is depicted facing downwards, i.e. the root of the tree is shown on the top, and the tips/leaves of the tree are shown at the bottom!

In order to obtain such a hierarchical clustering from data two opposing strategies are commonly used:

  1. divisive or recursive partitioning algorithms
    • grow the tree from the root downwards
    • first determine the main two clusters, then recursively refine the clusters further.
  2. agglomerative algorithms
    • grow the tree from the leafs upwards
    • successively form partitions by first joining individual object together, then recursively join groups of items together, until all is merged.

In the following we discuss a number of popular hierarchical agglomerative clustering algorithms that are based on the pairwise distances / similarities (a \(n \times n\) matrix) among all data points.

3.2.2 Agglomerative hierarchical clustering algorithms

A general algorithm for agglomerative construction of a hierarchical clustering works as follows:


Compute a dissimilarity / distance matrix between all pairs of objects where “objects” are single data points at this stage but later are also be sets of data points.

Iterative procedure:

  1. identify the pair of objects with the smallest distance. These two objects are then merged together into a common set. Create an internal node in the tree to describe this coalescent event.

  2. update the distance matrix by computing the distances between the new set and all other objects. If the new set contains all data points the procedure terminates.

For actual implementation of this algorithm two key ingredients are needed:

  1. a distance measure \(d(\boldsymbol a, \boldsymbol b)\) between two individual elementary data points \(\boldsymbol a\) and \(\boldsymbol b\).

This is typically one of the following:

  • Euclidean distance \(d(\boldsymbol a, \boldsymbol b) = \sqrt{\sum_{i=1}^d ( a_i-b_i )^2} = \sqrt{(\boldsymbol a-\boldsymbol b)^T (\boldsymbol a-\boldsymbol b)}\)
  • Manhattan distance \(d(\boldsymbol a, \boldsymbol b) = \sum_{i=1}^d | a_i-b_i |\)
  • Maximum norm \(d(\boldsymbol a, \boldsymbol b) = \underset{i \in \{1, \ldots, d\}}{\max} | a_i-b_i |\)

In the end, making the correct choice of distance will require subject knowledge about the data!

  1. a distance measure \(d(A, B)\) between two sets of objects \(A=\{\boldsymbol a_1, \boldsymbol a_2, \ldots, \boldsymbol a_{n_A} \}\) and \(B=\{\boldsymbol b_1, \boldsymbol b_2, \ldots, \boldsymbol b_{n_B}\}\) of size \(n_A\) and \(n_B\), respectively.

To determine the distance \(d(A, B)\) between these two sets the following measures are often employed:

  • complete linkage (max. distance): \(d(A, B) = \underset{\boldsymbol a_i \in A, \boldsymbol b_i \in B}{\max} d(\boldsymbol a_i, \boldsymbol b_i)\)
  • single linkage (min. distance): \(d(A, B) = \underset{\boldsymbol a_i \in A, \boldsymbol b_i \in B}{\min} d(\boldsymbol a_i, \boldsymbol b_i)\)
  • average linkage (avg. distance): \(d(A, B) = \frac{1}{n_A n_B} \sum_{\boldsymbol a_i \in A} \sum_{\boldsymbol b_i \in B} d(\boldsymbol a_i, \boldsymbol b_i)\)

3.2.3 Ward’s clustering method

Another agglomerative hierarchical procedure is Ward’s minimum variance approach. In this approach in each iteration the two sets \(A\) and \(B\) are merged that lead to the smallest increase in within-group variation, or equivalenty, 5h3 total within-group sum of squares (cf. \(K\)-means). The centroids of the two sets is given by \(\boldsymbol \mu_A = \frac{1}{n_A} \sum_{\boldsymbol a_i \in A} \boldsymbol a_i\) and \(\boldsymbol \mu_B = \frac{1}{n_B} \sum_{\boldsymbol b_i \in B} \boldsymbol b_i\).

The within-group sum of squares for group \(A\) is \[ w_A = \sum_{\boldsymbol a_i \in A} (\boldsymbol a_i -\boldsymbol \mu_A)^T (\boldsymbol a_i -\boldsymbol \mu_A) \] and is computed here on the basis of the difference of the observations \(\boldsymbol a_i\) relative to their mean \(\boldsymbol \mu_A\). However, it is also possible to compute it from the pairwise differences between the observations using \[ w_A = \frac{1}{n_A} \sum_{\boldsymbol a_i, \boldsymbol a_j \in A, i < j} (\boldsymbol a_i -\boldsymbol a_j)^T (\boldsymbol a_i -\boldsymbol a_j) \] This trick is used in Ward’s clustering method by constructing a distance measure between to sets \(A\) and \(B\) as \[ d(A, B) = w_{A \cup B} - w_A -w_B \, . \] Correspondingly, the distance between two elementary data points \(\boldsymbol a\) and \(\boldsymbol b\) is the squared Euclidean distance \[ d(\boldsymbol a, \boldsymbol b) = (\boldsymbol a- \boldsymbol b)^T (\boldsymbol a- \boldsymbol b) \, . \]

3.2.4 Application to Swiss banknote data set

This data set is reports 6 pysical measurements on 200 Swiss bank notes. Of the 200 notes 100 are genuine and 100 are counterfeit. The measurements are: length, left width, right width, bottom margin, top margin, diagonal length of the bank notes.

Plotting the first to PCAs of this data shows that there are indeed two well defined groups, and that these groups correspond precisely to the genuine and counterfeit banknotes:

We now compare the hierarchical clusterings of the Swiss bank note data using four different methods using Euclidean distance.

An interactive R Shiny web app of this analysis (which also allows to explore further distance measures) is available online at https://minerva.it.manchester.ac.uk/shiny/strimmer/hclust/ .

Ward.D2 (=Ward’s method):

Average linkage:

Complete linkage:

Single linkage:


  • All four trees / hierarchical clusterings are quite different!
  • The Ward.D2 method is the only one that finds the correct grouping (except for a single error).

3.2.5 Assessment of the uncertainty of hierarchical clusterings

In practical application of hierarchical clustering methods is is essential to evaluate the stability and uncertainty of the obtained groupings. This is often done as follows using the “bootstrap”:

  • Sampling with replacement is used to generate a number of so-called bootstrap data sets (say \(B=200\)) similar to the original one. Specifically, we create new data matrices by repeately randomly selecting columns (variables) from the original data matrix for inclusion in the bootstrap data matrix. Note that we sample columns as our aim is to cluster the samples.
  • Subsequently, a hierarchical clustering is computed for each of the bootstrap data sets. As a result, we now have an “ensemble” of \(B\) bootstrap trees.
  • Finally, analysis of the clusters (bipartions) shown in all the bootstrap trees allows to count the clusters that appear frequently, and also those that appear less frequently. These counts provide a measure of the stability of the clusterings appearing in the original tree.
  • Additionally, from the bootstrap tree we can also compute a consensus tree containing the most stable clusters. This an be viewed as an “ensemble average” of all the bootstrap trees.

A disadvantage of this procedure is that bootstrapping trees is computationally very very expensive, as the original procedure is already time consuming but now needs to be repeated a large number of times.

3.3 \(K\)-means clustering

3.3.1 General aims

  • Partition the data into \(K\) groups, with \(K\) given in advance
  • The groups are non-overlapping, so each of the \(n\) data points / objects \(\boldsymbol x_i\) is assigned to exactly one of the \(K\) groups
  • maximise the homogeneity with each group (i.e. each group should contain similar objects)
  • maximise the heterogeneity among the different groups (i.e each group should differ from the other groups)

3.3.2 Algorithm

For each group \(k \in \{1, \ldots, K\}\) we assume a group mean \(\boldsymbol \mu_k\).
After running \(K\)-means we will get estimates of \(\hat{\boldsymbol \mu}_k\) of the group means, as well as an allocation of each data point to one of the classes.


At the start of the algorithm the \(n\) observations \(\boldsymbol x_1, \ldots, \boldsymbol x_n\) are randomly allocated with equal probability to one of the \(K\) groups. The resulting assignment is given by the function \(C(\boldsymbol x_i) \in \{1, \ldots, K\}\). With \(G_k = \{ i | C(\boldsymbol x_i) = k\}\) we denote the set of indices of the data points in cluster \(k\), and with \(n_k = | G_k |\) the number of samples in cluster \(k\).

Iterative refinement:

  1. estimate the group means by \[ \hat{\boldsymbol \mu}_k = \frac{1}{n_k} \sum_{i \in G_k} \boldsymbol x_i \]
  2. update group assignment: each data point \(\boldsymbol x_i\) is (re)assigned to the group \(k\) with the nearest \(\hat{\boldsymbol \mu}_k\) (in terms of the Euclidean norm). Specifically, the assignment \(C(\boldsymbol x_i)\) is updated to \[ \begin{split} C(\boldsymbol x_i) & = \underset{k}{\arg \min} \, | \boldsymbol x_i-\hat{\boldsymbol \mu}_k |_2 \\ & = \underset{k}{\arg \min} \, (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i-\hat{\boldsymbol \mu}_k) \\ \end{split} \] Steps 1 and 2 are repeated until the algorithm converges (or an upper limit of repeats is reached).

3.3.3 Properties

Despite its simplicity \(K\)-means is a surprisingly effective clustering algorithms.

The final clustering depends on the initialisation so it is often useful to run \(K\)-means several times with different starting allocations of the data points. Furthermore, non-random or non-uniform initialisations can lead to improved and faster convergence, see e.g.  the K-means++ algorithm.

As a result of the way the clusters are assigned in \(K\)-means the corresponding cluster boundaries form a Voronoi tesselation (cf. https://en.wikipedia.org/wiki/Voronoi_diagram ) around the cluster means.

Later we will also discuss the connection of \(K\)-means with probabilistic clustering using Gaussian mixture models.

3.3.4 Choosing the number of clusters

Once the \(K\)-means clustering has been obtained it is insightful to compute:

  1. the total within-group sum of squares \(SSW\) (tot.withinss), or total unexplained sum of squares: \[ SSW = \sum_{k=1}^K \, \sum_{i \in G_k} (\boldsymbol x_i -\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i -\hat{\boldsymbol \mu}_k) \] This quantity decreases with \(K\) and is zero for \(K=n\). The \(K\)-means algorithm tries to minimise this quantity but it will typically only find a local minimum rather than the global one.

  2. the between-group sum of squares \(SSB\) (betweenss), or explained sum of squares: \[ SSB = \sum_{k=1}^K n_k (\hat{\boldsymbol \mu}_k - \hat{\boldsymbol \mu}_0)^T (\hat{\boldsymbol \mu}_k - \hat{\boldsymbol \mu}_0) \] where \(\hat{\boldsymbol \mu}_0 = \frac{1}{n} \sum_{i=1}^n \boldsymbol x_i = \frac{1}{n} \sum_{k=1}^K n_k \hat{\boldsymbol \mu}_k\) is the global mean of the samples. \(SSB\) increases with the number of clusters \(K\) until for \(K=n\) it becomes equal to

  3. the total sum of squares \[ SST = \sum_{i=1}^n (\boldsymbol x_i - \hat{\boldsymbol \mu}_0)^T (\boldsymbol x_i - \hat{\boldsymbol \mu}_0) \, . \] By construction \(SST = SSB + SSW\) for any \(K\) (i.e. \(SST\) is a constant independent of \(K\)).

If divide the sum of squares by the sample size \(n\) we get \(T = \frac{SST}{n}\) as the total variation, \(B = \frac{SSW}{n}\) as the explained variation and \(W = \frac{SSW}{n}\) as the total unexplained variation , with \(T = B + W\).

For deciding on the optimal number of clusters we can run \(K\)-means for various settings of \(K\) and then choose the smallest \(K\) for which the explained variation \(B\) is not significantly worse compared to a model with substantially larger number of clusters (see example below).

3.3.5 \(K\)-medoids aka PAM

A closely related clustering method is \(K\)-medoids or PAM (“Partitioning Around Medoids”).

This works exactly like \(K\)-means, only that

  • instead of the estimated group means \(\hat{\boldsymbol \mu}_k\) one member of each group is selected as its representative (the socalled “medoid”)
  • instead of squared Euclidean distance other dissimilarity measures are also allowed.

3.3.6 Application of \(K\)-means to Iris data

Scatter plots of Iris data:

The R output from a \(K\)-means analysis with true number of clusters specified (\(K=3\)) is:

## K-means clustering with 3 clusters of sizes 53, 47, 50
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  -0.05005221 -0.88042696    0.3465767   0.2805873
## 2   1.13217737  0.08812645    0.9928284   1.0141287
## 3  -1.01119138  0.85041372   -1.3006301  -1.2507035
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1
##  [75] 1 2 2 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1
## Within cluster sum of squares by cluster:
## [1] 44.08754 47.45019 47.35062
##  (between_SS / total_SS =  76.7 %)
## Available components:
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The corresponding total within-group sum of squares (\(SSW\), tot.withinss) is

## [1] 138.8884

and the between-group sum of squares (\(SSB\), betweenss) is

## [1] 457.1116

By comparing with the known class assignments we can assess the accuracy of \(K\)-means clustering:

table(L.iris, kmeans.out$cluster)
## L.iris        1  2  3
##   setosa      0  0 50
##   versicolor 39 11  0
##   virginica  14 36  0

For choosing \(K\) we run \(K\)-means several times and compute within and between cluster variation in dependence of \(K\):

Thus, \(K=3\) clusters seem appropriate since the the explained variation does not significantly improve (and the unexplained variation does not significantly decrease) with a further increase of the number of clusters.

3.3.7 Arbitrariness of cluster labels and label switching

It is important to realise that in unsupervised learning and clustering the labels of each group are assigned in an arbitrary fashion. Specifically, for \(K\) cluster there are \(K!\) possibilities to attach the labels, corresponding to the number of permutations of \(K\) groups.

Thus, a rerun of a clustering algorithm such as \(K\)-means may return the same clustering (groupings of samples) but with different labels. This phenomenon is called “label switching”.

Therefore when comparing clusterings obtained with an algorithm you cannot just rely on the group label, you need to compare the actual members of the clusters. Likewise, if you are interested in the properties of a particular group you cannot rely on the label to identify that group.

In order to resolve the problem of label switching one may wish to relabel the clusters using additional information, such as requiring that some samples are in specfic groups (e.g.: sample 1 is always in group labeled “1”), and/or linking labels to orderings or constraints on the group characteristics (e.g.: the group with label “1” has always a smaller mean that group with label “2”).

3.4 Mixture models

3.4.1 Finite mixture model

  • \(K\) groups / classes / categories, with the number \(K\) specified and finite
  • each class \(k \in \{1, \ldots, K\}\) is modeled by its own distribution \(F_k\) with own parameters \(\boldsymbol \theta_k\).
  • density in each class: \(f_k(\boldsymbol x) = f(\boldsymbol x| k)\) with \(k \in 1, \ldots, K\)
  • mixing weight of each class: \(\text{Pr}(k) = \pi_k\) with \(\sum_{k=1}^K \pi_k = 1\)
  • joint density \(f(\boldsymbol x, k) = f(\boldsymbol x| k) \text{Pr}(k) = f_k(\boldsymbol x) \pi_k\)

This results in the mixture density \[ f_{\text{mix}}(\boldsymbol x) = \sum_{k=1}^K \pi_k f_k(\boldsymbol x) \] This is also called marginal density as it arised from the joint density \(f(\boldsymbol x, k)\) by marginalising \(k\).

Very often one uses multivariate normal components \(f_k(\boldsymbol x) = N(\boldsymbol x| \boldsymbol \mu_k, \boldsymbol \Sigma_k)\) \(\\ \Longrightarrow\) Gaussian mixture model (GMM)

Mixture models are fundamental not just in clustering but for many other applications (e.g. classification).

Note: don’t confuse mixture model with mixed model (=random effects regression model)

3.4.2 Total variance and variation of mixture model

The conditional means and variances for each class are \(\text{E}(\boldsymbol x| k) = \boldsymbol \mu_k\) and \(\text{Var}(\boldsymbol x| k) = \boldsymbol \Sigma_k\), and the probability of class \(k\) is given by \(\text{Pr}(k)=\pi_k\). Using the law of total expectation we can therefore obtain the mean of the mixture density as follows: \[ \begin{split} \text{E}(\boldsymbol x) & = \text{E}(\text{E}(\boldsymbol x| k)) \\ & = \sum_{k=1}^K \pi_k \boldsymbol \mu_k \\ &= \boldsymbol \mu_0 \\ \end{split} \] Similarly, using the law of total variance we compute the marginal variance: \[ \begin{split} \underbrace{\text{Var}(\boldsymbol x)}_{\text{total}} & = \underbrace{ \text{Var}( \text{E}(\boldsymbol x| k ) )}_{\text{explained / between-group}} + \underbrace{\text{E}(\text{Var}(\boldsymbol x|k))}_{\text{unexplained / within-group}} \\ \boldsymbol \Sigma_0 & = \sum_{k=1}^K \pi_k (\boldsymbol \mu_k - \boldsymbol \mu_0) (\boldsymbol \mu_k - \boldsymbol \mu_0)^T + \sum_{k=1}^K \pi_k \boldsymbol \Sigma_k \\ \end{split} \] Thus, just like in linear regression (see MATH20802 Statistical Methods) you can decompose the total variance into an explained (between group) part and an unexplained (within group) part.

The total variation is given by the trace of the covariance matrix, so the above decomposition turns into \[ \begin{split} \text{Tr}(\boldsymbol \Sigma_0) & = \sum_{k=1}^K \pi_k \text{Tr}((\boldsymbol \mu_k - \boldsymbol \mu_0) (\boldsymbol \mu_k - \boldsymbol \mu_0)^T) + \sum_{k=1}^K \pi_k \text{Tr}(\boldsymbol \Sigma_k) \\ & = \sum_{k=1}^K \pi_k (\boldsymbol \mu_k - \boldsymbol \mu_0)^T (\boldsymbol \mu_k - \boldsymbol \mu_0) + \sum_{k=1}^K \pi_k \text{Tr}(\boldsymbol \Sigma_k)\\ \end{split} \] If the covariances are replaced by their empirical estimates we obtain the \(T=B+W\) decomposition of total variation familiar from \(K\)-means: \[T = \text{Tr}\left( \hat{\boldsymbol \Sigma}_0 \right) = \frac{1}{n} \sum_{i=1}^n (\boldsymbol x_i - \hat{\boldsymbol \mu}_0)^T (\boldsymbol x_i - \hat{\boldsymbol \mu}_0)\] \[B = \frac{1}{n} \sum_{k=1}^K n_k (\hat{\boldsymbol \mu}_k - \hat{\boldsymbol \mu}_0)^T (\hat{\boldsymbol \mu}_k - \hat{\boldsymbol \mu}_0)\] \[W = \frac{1}{n} \sum_{k=1}^K \, \sum_{i \in G_k} (\boldsymbol x_i -\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i -\hat{\boldsymbol \mu}_k) \]

For a univariate mixture (\(d=1\)) with \(K=2\) components we get \[ \mu_0 = \pi_1 \mu_1+ \pi_2 \mu_2 \, , \] \[ \sigma^2_{\text{within}} = \pi_1 \sigma^2_1 + \pi_2 \sigma^2_2 = \sigma^2_{\text{pooled}}\,, \] also know as pooled variance, and \[ \begin{split} \sigma^2_{\text{between}} &= \pi_1 (\mu_1 - \mu_0)^2 + \pi_2 (\mu_2 - \mu_0)^2 \\ & =\pi_1 \pi_2^2 (\mu_1 - \mu_2)^2 + \pi_2 \pi_1^2 (\mu_1 - \mu_2)^2\\ & = \pi_1 \pi_2 (\mu_1 - \mu_2)^2 \\ & = \left( \frac{1}{\pi_1} + \frac{1}{\pi_2} \right)^{-1} (\mu_1 - \mu_2)^2 \\ \end{split} \,. \] The ratio of the between-group variance and the within-group variance is proportional (by factor of \(n\)) to the squared pooled-variance \(t\)-score: \[ \frac{\sigma^2_{\text{between}}}{\sigma^2_{\text{within}}} = \frac{ (\mu_1 - \mu_2)^2}{ \left(\frac{1}{\pi_1} + \frac{1}{\pi_2} \right) \sigma^2_{\text{pooled}} }= \frac{t_{\text{pooled}}^2}{n} \] If you are familiar with ANOVA (e.g. linear models course) you will recognise this ratio as the \(F\)-score.

3.4.3 Example of mixtures

Mixtures can take on many different shapes and forms, so it is instructive to study a few examples.

In first plot we show the density of a mixture distribution consisting of two normals with \(\pi_1=0.7\), \(\mu_1=-1\), \(\mu_2=2\) and the two variances equal to 1 (\(\sigma^2_1 = 1\) and \(\sigma^2_2 = 1\)). Because the two components are well-separated there are two clear modes. The plot also shows the density of a normal distribution with the same total mean (\(\mu_0=-0.1\)) and variance (\(\sigma_0^2=2.89\)) as the mixture distribution. Clearly the total normal and the mixture density are very different.

However, mixtures can also look very different. For example, if the mean of the second component is adjusted to \(\mu_2=0\) then there is only a single mode and the total normal density with \(\mu_0=-0.7\) and \(\sigma_0^2=1.21\) is now almost inistinguishable in form from the mixture density. Thus, in this case it will be very hard (or even impossible) to identify the two peaks from data.

An interactive version of the above two normal component mixture is available online as R Shiny web app at https://minerva.it.manchester.ac.uk/shiny/strimmer/mixture/ .

In general, learning mixture models from data can be challenging due to various issues. First, because of permutation symmetries due to the arbitrariness of group labels the group specific parameters are not identiable without additional restrictions. Second, further identifiability issues can arise if — as in the above example — two neighboring components of the mixture model are largely overlapping and thus are too close to each other to be discriminated as two different modes. Furthermore, likelihood estimation is challenging if there are singularities in the likelihood function, for example due to singular estimated covariance matrices. However, this can be easily fixed by regularising and/or requiring sufficient sample size per group.

Mixture models need not to be univariate, in fact most mixtures we consider in this course are multivariate. For illustration, here is a plot of a mixture of two bivariate normals, with \(\pi_1=0.7\), \(\boldsymbol \mu_1 = \begin{pmatrix}-1 \\1 \\ \end{pmatrix}\), \(\boldsymbol \Sigma_1 = \begin{pmatrix} 1 & 0.7 \\ 0.7 & 1 \\ \end{pmatrix}\), \(\boldsymbol \mu_2 = \begin{pmatrix}2.5 \\0.5 \\ \end{pmatrix}\) and \(\boldsymbol \Sigma_2 = \begin{pmatrix} 1 & -0.7 \\ -0.7 & 1 \\ \end{pmatrix}\):

3.4.4 Generative view: sampling from a mixture model

Assuming we know how to sample from the component densities \(f_k(\boldsymbol x)\) of the mixture model it is straightforward to set up a procedure for sampling from the mixture \(f_{\text{mix}}(\boldsymbol x) = \sum_{k=1}^K \pi_k f_k(\boldsymbol x)\).

This is done in a two-step generative process:

  1. draw from categorical distribution with parameters \(\boldsymbol \pi=(\pi_1, \ldots, \pi_K)^T\): \[\boldsymbol z\sim \text{Cat}(\boldsymbol \pi)\] the vector \(\boldsymbol z= (z_1, \ldots, z_K)^T\) indicating the group allocation. The group index \(k\) is given by \(\{k : z_k=1\}\).

  2. Subsequently, sample from the component \(k\) selected in step 1: \[ \boldsymbol x\sim F_k \]

This two-stage approach is also called latent allocation variable formulation of a mixture model, with \(\boldsymbol z\) (or equivalently \(k\)) being the latent variable.

The two-step process needs to repeated for each sample drawn from the mixture (i.e. every time a new latent variable \(\boldsymbol z\) is generated).

In probabilistic clustering the aim is to infer the state of \(\boldsymbol z\) for all observed samples.

3.4.5 Predicting the group allocation of a given sample

If we know the mixture model and its components we can predict the probability that an observation \(\boldsymbol x\) falls in group \(k\) via application of Bayes theorem: \[ z_k = \text{Pr}(k | \boldsymbol x) = \frac{\pi_k f_k(\boldsymbol x) }{ f(\boldsymbol x)} \]

In the above mentioned interactive Shiny app for the normal component mixture (available online at https://minerva.it.manchester.ac.uk/shiny/strimmer/mixture/ ) you can also explore the probabilities of each class.

Thus, by calculating the class probabilities using a fitted mixture model we can perform probabilistic clustering by assigning each sample to the class with the largest probability. Unlike in algorithmic clustering, we also get an assessment of the uncertainty of the class assignment, since for each sample \(\boldsymbol x\) we obtain the vector \[ \boldsymbol z= (z_1, \ldots, z_K)^T \] and thus can see if there are several classes with similar assignment probability. This will be the case, e.g., if \(\boldsymbol x\) lies near the boundary between two classes. Note that \(\sum_{k=1}^K z_k=1\).

3.4.6 Variation 1: Infinite mixture model

It is possible to construct mixture models with infinitely many components!

Most commonly known example is Dirichlet process mixture model (DPM):

\[ f_{\text{mix}}(\boldsymbol x) = \sum_{k=1}^\infty \pi_k f_k(\boldsymbol x) \] with \(\sum_{k=1}^\infty\pi_k =1\) and where the weight \(\pi_k\) are taken from a infinitely dimensional Dirichlet distribution (=Dirichlet process).

DPMs are useful for clustering since with them it is not necessary to determine the number of clusters a priori (since it by definition has infinitely many!). Instead, the number of clusters is a by-product of the fit of the model to observed data.

Related: “Chinese restaurant process” - https://en.wikipedia.org/wiki/Chinese_restaurant_process

This describes an algorithm for the allocation process of samples (“persons”) to the groups (“restaurant tables”) in a DPM.

See also “stick-breaking process”: https://en.wikipedia.org/wiki/Dirichlet_process#The_stick-breaking_process

3.4.7 Variation 2: Semiparametric mixture model with two classes

A very common model is the following two-component univariate mixture model

\[f_{\text{mix}}(x) = \pi_0 f_0(x) + (1-\pi_0) f_A(\boldsymbol x)\]

  • \(f_0\): null model, typically parametric such as normal distribution
  • \(f_A\): alternative model, typically nonparametric
  • \(\pi_0\): prior probability of null model

Using Bayes theorem this allows to compute probability that an observation \(x\) belongs to the null model: \[\text{Pr}(\text{Null} | x ) = \frac{\pi_0 f_0(x ) }{ f(x) }\] This is called the local false discovery rate.

The semi-parametric mixture model is the foundation for statistical testing which is based on defining decision thresholds to separate null model (“not significant”) from alternative model (“significant”):

See MATH20802 Statistical Methods for more details.

3.5 Fitting mixture models to data

3.5.1 Direct estimation of mixture model parameters

Given data matrix \(\boldsymbol X= (\boldsymbol x_1, \ldots, \boldsymbol x_n)^T\) with observations from \(n\) samples we would like to fit the mixture model \(f_{\text{mix}}(\boldsymbol x) = \sum_{k=1}^K \pi_k f_k(\boldsymbol x)\) and learn its parameters \(\boldsymbol \theta\), for example by maximising the corresponding marginal log-likelihood function with regard to \(\boldsymbol \theta\): \[ \log L(\boldsymbol \theta| \boldsymbol X) = \sum_{i=1}^n \log \left( \sum_{k=1}^K \pi_k f_k(\boldsymbol x_i) \right) \] For a Gaussian mixture model the parameters are \(\boldsymbol \theta= \{\boldsymbol \pi, \boldsymbol \mu_1, \ldots, \boldsymbol \mu_K, \boldsymbol \Sigma_1, \ldots, \boldsymbol \Sigma_K\}\).

However, in practise evaluation and optimisation of this likelihood function may be difficult due a number of reasons:

  • The form of the log-likelihood function prevents analytic simplifications (note the sum inside the logarithm).
  • Because of the symmetries due to exchangeability of cluster labels the likelihood function is multimodal (note this is also linked to the general problem of label switching and non-identifiability of cluster labels in mixtures).
  • Furthermore, the likelihood in Gaussian mixture models can become singular if one of the fitted covariance matrices becomes singular. Note this can be adressed by using regularisation (Bayes, penalised ML, etc.).

The above log-likelihood function is also called the observed data log-likelihood, or the incomplete data log-likelihood, in contrast to the complete data log-likelihood described further below.

3.5.2 Estimating mixture model parameters using the EM algorithm

The mixture model may be viewed as an incomplete or missing data problem: here the missing data are the group allocations \(\boldsymbol k= (k_1, \ldots, k_n)^T\) belonging to each sample \(\boldsymbol x_1, \ldots, \boldsymbol x_n\).

If we would know which sample comes from which group the estimation of the parameters \(\boldsymbol \theta\) would indeed be straightforward using the so-called complete data log-likelihood based on the joint distribution \(f(\boldsymbol x, k) = \pi_k f_k(\boldsymbol x)\) \[ \log L(\boldsymbol \theta| \boldsymbol X, \boldsymbol k) = \sum_{i=1}^n \log \left(\pi_{k_i} f_{k_i}(\boldsymbol x_i) \right) \]

The idea of the EM algorithm (Dempster et al. 1977) is to exploit the simplicity of the complete data likelihood and to obtain estimates of \(\boldsymbol \theta\) by first finding the probability distribution \(z_{ik}\) of the latent variable \(k_i\), and then using this distribution to compute and optimise the corresponding expected complete-data log-likelihood. Specifically, the \(z_{ik}\) contain the probabilities of each class for each sample \(i\) and thus provide a soft assignment of classes rather that a 0/1 hard assignment (as in the \(K\)-means algorithm or in the generative latent variable view of mixture models).

In the EM algorithm we iterate between the

  1. estimation the probabilistic distribution \(z_{ik}\) for the group allocation latent parameters using the current estimate of the parameters \(\boldsymbol \theta\) (obtained in step 2)
  2. maximisation of the expected complete data log-likelihood to estimate the parameters \(\boldsymbol \theta\). The expectation is taken with regard to the distribution \(z_{ik}\) (obtained in step 1).

Specifically, the EM algorithm applied to model-based clustering proceeds as follows:

  1. Initialisation: Start with a guess of the parameters \(\boldsymbol \theta^{(1)}\), then continue with “E” Step, Part A. Alternatively, start with a guess of \(z_{ik}^{(1)}\), then continue with “E” Step, Part B. The initialisation may be derived from some prior information, e.g., from running \(K\)-means, or simply be at random.

  2. E “expectation” step — Part A: Use Bayes’ theorem to compute new probabilities of allocation for all the samples \(\boldsymbol x_i\): \[ z_{ik}^{(b+1)} \leftarrow \frac{ \pi_k f_k(\boldsymbol x_i) }{ f(\boldsymbol x_i) } \] Note that to obtain \(z_{ik}^{(b+1)}\) the current value \(\boldsymbol \theta^{(b)}\) of the parameters is required.
    — Part B: Construct the expected complete data log-likelihood function using the weights \(z_{ik}^{(b+1)}\): \[ Q^{(b+1)}(\boldsymbol \theta| \boldsymbol X) = \sum_{i=1}^n \sum_{k=1}^K z_{ik}^{(b+1)} \log \left( \pi_k f_k(\boldsymbol x_i) \right) \]

  3. M “maximisation” step — Maximise the expected complete data log-likelihood to update the mixture model parameters \(\boldsymbol \theta\): \[ \boldsymbol \theta^{(b+1)} \leftarrow \arg \max_{\boldsymbol \theta} Q^{(b+1)}(\boldsymbol \theta| \boldsymbol X) \]

  4. Repeat with “E” Step until convergence of parameters \(\boldsymbol \theta^{(b)}\) of the mixture model.

It can be shown that the output \(\boldsymbol \theta^{(1)}, \boldsymbol \theta^{(2)}, \boldsymbol \theta^{(3)}, \ldots\) of the EM algorithm converges to the estimate \(\hat{\boldsymbol \theta}\) found when maximising the marginal log-likelihood. Since maximisation of the expected complete data log-likelihood is often much easier (and analytically tractable) than maximisation of the observed data log-likelihood function the EM algorithm is the preferred approach in this case.

To avoid singularities in the expected log-likelihood function we may wish to adopt a Bayesian approach (or use regularised/penalised ML) for estimating the parameters in the M-step.

3.5.3 EM algorithm for multivariate normal mixture model

For a GMM the EM algorithm can be written down analytically:


\[ z_{ik} = \frac{ \hat{\pi}_k N(\boldsymbol x_i | \hat{\boldsymbol \mu}_k, \hat{\boldsymbol \Sigma}_k) }{ f(\boldsymbol x_i) } \]


\[ \hat{n}_k = \sum_{i=1}^n z_{ik} \] \[ \hat{\pi}_k = \frac{\hat{n}_k}{n} \]

\[ \hat{\boldsymbol \mu}_k = \frac{1}{\hat{n}_k} \sum_{i=1}^n z_{ik} \boldsymbol x_i \] \[ \hat{\boldsymbol \Sigma}_k = \frac{1}{\hat{n}_k} \sum_{i=1}^n z_{ik} ( \boldsymbol x_i -\boldsymbol \mu_k) ( \boldsymbol x_i -\boldsymbol \mu_k)^T \]

Note that the estimators \(\hat{\boldsymbol \mu}_k\) and \(\hat{\boldsymbol \Sigma}_k\) are weighted versions of the usual empirical estimators (with weights \(z_{ik}\) being the soft assignment of classes resulting from the Bayesian updating).

In Worksheet 7 you can find a simple R implementation of the EM algorithm for univariate normal mixtures.

3.5.4 Connection with \(K\)-means clustering method

The \(K\)-means algorithm is very closely related to probabilistic clustering with a Gaussian mixture models.

Specifically, the class assigment in \(K\)-means is \[C(\boldsymbol x_i) = \underset{k}{\arg \min} \, (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)\]

If in a Gaussian mixture model the probabilities \(\pi_k\) of all classes are asssumed to be identical (i.e. \(\pi_k=\frac{1}{K}\)) and the covariances \(\boldsymbol \Sigma_k\) are all of the same spherical form \(\sigma^2 \boldsymbol I\), i.e. no dependence on groups, no correlation and identical variance for all variables, then the soft assignment for the class allocation becomes \[\log( z_{ik} ) = -\frac{1}{2 \sigma^2} (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i-\hat{\boldsymbol \mu}_k) + C \] where \(C\) is a constant depending on \(\boldsymbol x_i\) but not on \(k\). In order to select a hard class allocation based on \(z_{ik}\) we may then use the rule \[ \begin{split} C(\boldsymbol x_i) &= \underset{k}{\arg \max} \log( z_{ik} ) \\ & = \underset{k}{\arg \min} (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)^T (\boldsymbol x_i-\hat{\boldsymbol \mu}_k)\\ \end{split} \] Thus, \(K\)-means can be viewed as an algorithm to provide hard classifications for a simple restricted Gaussian mixture model.

3.5.5 Choosing the number of classes

Since GMMs operate in a likelihood framework we can use penalised likelihood model selection criteria to choose among different models (i.e. GMMs with different numbers of classes).

The most popular choices are AIC (Akaike Information Criterion) and BIC (Bayesian Information criterion) defined as follows: \[\text{AIC}= -2 \log L + 2 K \] \[\text{BIC}= - 2 \log L +K \log(n)\]

Instead of maximising the log-likehood we minimise \(\text{AIC}\) and \(\text{BIC}\).

Note that in both criteria more complex models with more parameters (in this case groups) are penalised over simpler models in order to prevent overfitting.

\(\Longrightarrow\) find optimal number of groups \(K\).

Another way of choosing optimal numbers of clusters is by cross-validation (see later chapter on supervised learning).

3.5.6 Application of GMMs to Iris flower data

We now explore the application of Gaussian mixture models to the Iris flower data set we also investigated with PCA and K-means.

First, we fit a GMM with 3 clusters, using the R software “mclust.”9

gmm3 = Mclust(X.iris, G=3, verbose=FALSE)
plot(gmm3, what="classification")

The “mclust” software has used the following model when fitting the mixture:

## [1] "VVV"

Here “VVV” is the name used by the “mclust” software for a model allowing for an individual unrestricted covariance matrix \(\boldsymbol \Sigma_k\) for each class \(k\).

This GMM has a substantially lower misclassification error compared to \(K\)-means with the same number of clusters:

table(gmm3$classification, L.iris)
##    L.iris
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         45         0
##   3      0          5        50

Note that in “mclust” the BIC criterion is defined with the opposite sign (\(\text{BIC}_{\text{mclust}} = 2 \log L -K \log(n)\)), thus we need to find the maximum value rather than the smallest value.

If we compute BIC for various numbers of groups we find that the model with the best \(\text{BIC}_{\text{mclust}}\) is a model with 2 clusters but the model with 3 cluster has nearly as good a BIC: