**Update: the 2020/21 version of this course is available now
(MATH38161 lecture notes).**

Academic Year 2018/19, Term 1

School of Mathematics, The University of Manchester

Lecturer: Korbinian Strimmer

Tutors: Beatriz Costa Gomes, Jack Mckenzie

**Overview and syllabus:**

For an outline of this course unit see MATH38161: Multivariate statistics and machine learning or download course description as PDF.

**Dates and location:**

**starts 26th September 2018**and runs until 14th December 2018. The

**first computer lab is on 5th October 2018**and the

**first tutorial on 12th October 2018**. All lectures, tutorials and computer labs are held in the Alan Turing Building (ATB).

The course takes place at the following dates and locations:

Session | Time slot (location) | Term week |
---|---|---|

Lectures: |
Wednesday 11am-12 noon (ATB G207) and Friday 12 noon-1pm (ATB G107) | 1-5 and 7-12 |

Computer labs: |
Friday 4pm-6pm (ATB G105) | 2, 4, 7, 8, 10 |

Tutorials: |
Friday 4pm-5pm (ATB G205) | 3, 5, 9, 11 |

Office hour: |
Friday 3pm-4pm (ATB 2.221) | 1-5, 7-11 |

**Course works and exam:**

The two course works are each worth 25% and require data analysis and simulation in R and writing of a corresponding report in R Markdown. The written exam (1.5 hours) is worth the remaining 50% and is concerned with theory and methods.

Assessment | Date | Term week |
---|---|---|

Course work #1 (25%): |
announced Tuesday 23 October 2018; submission Tuesday 6 November 2018, 12 noon |
5 and 7 |

Course work #2 (25%): |
announced Tuesday 11 December 2018; submission Tuesday 8 January 2019 12 noon |
12 and CVAC 04 |

Written exam (50%): |
Thursday 24 January 2019, 9:45 (1.5 hours) | exam period |

**Statistical computing:**

In this course **strong emphasis is put on computation**. All methods
introduced and discussed
in the lectures will be tried and tested on the computer.

**minerva**computational statistics server.

In the bi-weekly computer labs we will work in R Studio, **using
R for statistical data analysis and
R Markdown for project reporting**. Students are strongly
encouraged to install R and the R Studio software on their own computers. Course participants will
also get an account on the School of Mathematics computational statistics
cloud server **minerva**
to access R Studio in a web browser (for use in the computer labs and to facilitate the coursework).

**Prerequisites:**

**readings to refresh knowledge**in statistics, matrices and R:

a) Dekking et al. 2005. A modern introduction to probability and statistics: understanding why and how. Springer.

b) Petersen and Pedersen. 2012. The matrix cookbook. TU Denmark.

c) R Core Team. 2018. An introduction to R. The R Foundation.

d) Peng. 2016. R programming for data science. Leanpub.

This course assumes **students are familiar** with the foundations of **probability**, **statistical learning** (e.g. maximum likelihood) and **matrix theory** (e.g.
matrix notation and algebra, eigenvalues, singular values, spectral decomposition, rank,
condition etc.). Furthermore, basic
experience in **statistical programming and data analysis using R** is expected.

**Course material:**

This course uses material from several text books - all can be downloaded freely from within the University network:

- Härdle and Simar. 2015. Applied multivariate statistical analysis. 4th edition.
- Hastie, Tibshirani and Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Springer.
- James, Witten, Hastie andTibshirani. 2013. An introduction to statistical learning with applications in R. Springer.

For learning R markdown please study the following references:

- The R markdown homepage.
- R Studio. 2014. R markdown reference guide.
- Shalizi. 2016. Using R markdown for class reports.
- Xie, Allaire and Grolemund. 2018. R markdown: the definitive guide.

The timetable below will be updated at the end of each week linking the presented material to specific chapters in these books. In addition, the scanned material (visualiser) from the lectures will be available on Blackboard. Furthermore, the automated lecture capture system is active for this module so lectures can be revisited online.

**Lecture timetable and contents:**

The lectures are divided into six parts, each dealing with a different area in multivariate statistics and machine learning:

Term week | Lecture (Date) | Content | Reading material |
---|---|---|---|

1, 2 | 1-4 (26 Sept to 5 Oct) |
Multivariate random variables and distributions: basic multivariate statistics,
multivariate normal distribution and properties, further multivariate distributions
(categorical, multinomial, Dirichlet, Wishart); Estimation in large sample and small sample settings: estimation of covariance using likelihood and
regularised/shrinkage estimation. |
See lecture notes (Part 1 and 2) on Blackboard |

3, 4 | 5-8 (10 Oct to 19 Oct) |
Transformations and dimension reduction: variable transformations,
location-scale transformation, corresponding transformation of mean, variance and probability
density, coloring transformation, Mahalanobis transformation,
whitening transformations (ZCA, PCA, Cholesky and variations),
Principle Components Analysis, Canonical Correlation Analysis (CCA). |
See lecture notes (Part 3) on Blackboard |

5, 7 | 9-12 (24 Oct to 9 Nov) |
Unsupervised learning / structure discovery: Algorithmic / heuristic approaches to clustering: K-means, PAM, hierarchical clustering, measuring uncertainty, model-based clustering: Gaussian mixture models, EM algorithm, graphical models. |
See lecture notes (Part 4) on Blackboard |

8, 9 | 13-18 (14 Nov to 23 Nov) |
Supervised learning / prediction and classification: Diagonal, Linear, and Quadratic
Discriminant Analysis (DDA, LDA, QDA) and regularised versions for high-dimensional
data analysis, crossvalidation, feature selection and variable importance, linear prediction. |
See lecture notes (Part 5) on Blackboard |

10, 11 | 19-21 (30 Nov to 7 Dec) |
Nonlinear and nonparametric models / machine learning models: Anscombe data sets, nonlinear regression (polynomial, splines, loess), decision trees, random forest, overview over neural networks |
See lecture notes (Part 6) on Blackboard |

12 | 22 (12 Dec to 14 Dec) |
Exam revision (Wednesday) and Q & A (Friday) |

**Computer labs timetable and contents:**

Term week | Lab (Date) | Topic | Work material |
---|---|---|---|

2 | 1 (5 Oct) |
Introduction to minerva computer system, overview over R Studio (server), introduction to R Markdown, exploring multivariate normal density and estimation of covariances. | You find the material for Lab 1 on Blackboard. |

4 | 2 (19 Oct) |
Simulation of multivariate normal data, comparison of whitening procedures, PCA analysis and dimension reduction. | You find the material for Lab 2 on Blackboard. |

7 | 3 (9 Nov) |
Unsupervised learning using K-means, Gaussian mixture model and hierarchical clustering methods. | You find the material for Lab 3 on Blackboard. |

8 | 4 (16 Nov) |
Supervised learning / classification with QDA and LDA and shrinkage LDA / DDA, cross-validation, comparison with GGMs / hierarchical clustering, constructing efficient high-dimensional classifier, feature selection, conditional independence graph. | You find the material for Lab 4 on Blackboard. |

10 | 5 (30 Nov) |
DatasauRus dozen data sets, nonlinear regression, random forest, feature selection for wine data. | You find the material for Lab 5 on Blackboard. |

**Tutorials timetable and contents:**

Term week | Tutorial (Date) | Example sheets |
---|---|---|

3 | 1 (12 Oct) |
You find the material for Sheet 1 on Blackboard. |

5 | 2 (26 Oct) |
You find the material for Sheet 2 on Blackboard. |

9 | 4 (23 Nov) |
You find the material for Sheet 3 on Blackboard. |

11 | 5 (7 Dec) |
You find the material for Sheet 4 on Blackboard. |

**Coursework timetable and contents:**

Term week | Coursework | Submission date | Task |
---|---|---|---|

7 | 1 | 6 Nov 12 noon | Task 1: PCA Analysis of UCI Wine data set |

CVAC 4 | 2 | 8 Jan 12 noon | Task 2: Classification analysis |