Update: the 2020/21 version of this course is available now (MATH38161 lecture notes).

Academic Year 2018/19, Term 1
School of Mathematics, The University of Manchester

Lecturer: Korbinian Strimmer
Tutors: Beatriz Costa Gomes, Jack Mckenzie

Overview and syllabus:

For an outline of this course unit see MATH38161: Multivariate statistics and machine learning or download course description as PDF.

Dates and location:

The course starts 26th September 2018 and runs until 14th December 2018. The first computer lab is on 5th October 2018 and the first tutorial on 12th October 2018. All lectures, tutorials and computer labs are held in the Alan Turing Building (ATB).

The course takes place at the following dates and locations:

Session Time slot (location) Term week
Lectures: Wednesday 11am-12 noon (ATB G207) and Friday 12 noon-1pm (ATB G107) 1-5 and 7-12
Computer labs: Friday 4pm-6pm (ATB G105) 2, 4, 7, 8, 10
Tutorials: Friday 4pm-5pm (ATB G205) 3, 5, 9, 11
Office hour: Friday 3pm-4pm (ATB 2.221) 1-5, 7-11

Course works and exam:

The two course works are each worth 25% and require data analysis and simulation in R and writing of a corresponding report in R Markdown. The written exam (1.5 hours) is worth the remaining 50% and is concerned with theory and methods.

Assessment Date Term week
Course work #1 (25%): announced Tuesday 23 October 2018;
submission Tuesday 6 November 2018, 12 noon
5 and 7
Course work #2 (25%): announced Tuesday 11 December 2018;
submission Tuesday 8 January 2019 12 noon
12 and CVAC 04
Written exam (50%): Thursday 24 January 2019, 9:45 (1.5 hours) exam period

Statistical computing:

In this course strong emphasis is put on computation. All methods introduced and discussed in the lectures will be tried and tested on the computer.

Log into R Studio on the minerva computational statistics server.

In the bi-weekly computer labs we will work in R Studio, using R for statistical data analysis and R Markdown for project reporting. Students are strongly encouraged to install R and the R Studio software on their own computers. Course participants will also get an account on the School of Mathematics computational statistics cloud server minerva to access R Studio in a web browser (for use in the computer labs and to facilitate the coursework).

Prerequisites:

Suggested readings to refresh knowledge in statistics, matrices and R:
a) Dekking et al. 2005. A modern introduction to probability and statistics: understanding why and how. Springer.
b) Petersen and Pedersen. 2012. The matrix cookbook. TU Denmark.
c) R Core Team. 2018. An introduction to R. The R Foundation.
d) Peng. 2016. R programming for data science. Leanpub.

This course assumes students are familiar with the foundations of probability, statistical learning (e.g. maximum likelihood) and matrix theory (e.g. matrix notation and algebra, eigenvalues, singular values, spectral decomposition, rank, condition etc.). Furthermore, basic experience in statistical programming and data analysis using R is expected.

Course material:

This course uses material from several text books - all can be downloaded freely from within the University network:

  1. Härdle and Simar. 2015. Applied multivariate statistical analysis. 4th edition.
  2. Hastie, Tibshirani and Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Springer.
  3. James, Witten, Hastie andTibshirani. 2013. An introduction to statistical learning with applications in R. Springer.

For learning R markdown please study the following references:

  1. The R markdown homepage.
  2. R Studio. 2014. R markdown reference guide.
  3. Shalizi. 2016. Using R markdown for class reports.
  4. Xie, Allaire and Grolemund. 2018. R markdown: the definitive guide.

The timetable below will be updated at the end of each week linking the presented material to specific chapters in these books. In addition, the scanned material (visualiser) from the lectures will be available on Blackboard. Furthermore, the automated lecture capture system is active for this module so lectures can be revisited online.

Lecture timetable and contents:

The lectures are divided into six parts, each dealing with a different area in multivariate statistics and machine learning:

Term week Lecture (Date) Content Reading material
1, 2 1-4
(26 Sept to 5 Oct)
Multivariate random variables and distributions: basic multivariate statistics, multivariate normal distribution and properties, further multivariate distributions (categorical, multinomial, Dirichlet, Wishart); Estimation in large sample and small sample settings: estimation of covariance using likelihood and regularised/shrinkage estimation. See lecture notes (Part 1 and 2) on Blackboard
3, 4 5-8
(10 Oct to 19 Oct)
Transformations and dimension reduction: variable transformations, location-scale transformation, corresponding transformation of mean, variance and probability density, coloring transformation, Mahalanobis transformation, whitening transformations (ZCA, PCA, Cholesky and variations), Principle Components Analysis, Canonical Correlation Analysis (CCA). See lecture notes (Part 3) on Blackboard
5, 7 9-12
(24 Oct to 9 Nov)
Unsupervised learning / structure discovery: Algorithmic / heuristic approaches to clustering: K-means, PAM, hierarchical clustering, measuring uncertainty, model-based clustering: Gaussian mixture models, EM algorithm, graphical models. See lecture notes (Part 4) on Blackboard
8, 9 13-18
(14 Nov to 23 Nov)
Supervised learning / prediction and classification: Diagonal, Linear, and Quadratic Discriminant Analysis (DDA, LDA, QDA) and regularised versions for high-dimensional data analysis, crossvalidation, feature selection and variable importance, linear prediction. See lecture notes (Part 5) on Blackboard
10, 11 19-21
(30 Nov to 7 Dec)
Nonlinear and nonparametric models / machine learning models: Anscombe data sets, nonlinear regression (polynomial, splines, loess), decision trees, random forest, overview over neural networks See lecture notes (Part 6) on Blackboard
12 22
(12 Dec to 14 Dec)
Exam revision (Wednesday) and Q & A (Friday)

Computer labs timetable and contents:

Term week Lab (Date) Topic Work material
2 1
(5 Oct)
Introduction to minerva computer system, overview over R Studio (server), introduction to R Markdown, exploring multivariate normal density and estimation of covariances. You find the material for Lab 1 on Blackboard.
4 2
(19 Oct)
Simulation of multivariate normal data, comparison of whitening procedures, PCA analysis and dimension reduction. You find the material for Lab 2 on Blackboard.
7 3
(9 Nov)
Unsupervised learning using K-means, Gaussian mixture model and hierarchical clustering methods. You find the material for Lab 3 on Blackboard.
8 4
(16 Nov)
Supervised learning / classification with QDA and LDA and shrinkage LDA / DDA, cross-validation, comparison with GGMs / hierarchical clustering, constructing efficient high-dimensional classifier, feature selection, conditional independence graph. You find the material for Lab 4 on Blackboard.
10 5
(30 Nov)
DatasauRus dozen data sets, nonlinear regression, random forest, feature selection for wine data. You find the material for Lab 5 on Blackboard.

Tutorials timetable and contents:

Term week Tutorial (Date) Example sheets
3 1
(12 Oct)
You find the material for Sheet 1 on Blackboard.
5 2
(26 Oct)
You find the material for Sheet 2 on Blackboard.
9 4
(23 Nov)
You find the material for Sheet 3 on Blackboard.
11 5
(7 Dec)
You find the material for Sheet 4 on Blackboard.

Coursework timetable and contents:

Term week Coursework Submission date Task
7 1 6 Nov 12 noon Task 1: PCA Analysis of UCI Wine data set
CVAC 4 2 8 Jan 12 noon Task 2: Classification analysis