STAT 541: Statistics for Learning
STAT 541: Statistics for Learning
Welcome to the course website for STAT 541!
Information and resources for the course can be found on this page. Click on the section headings to expand them. For assignment submission and grades please see Canvas.
Announcements:
Assignment 5 has now been posted. It is due April 9th at 11:59 PM.
The final exam is scheduled to take place on April 14th at 1:00 PM in BUS 1-05. Note that this is not our usual classroom so please allocate some extra time to find the room.
The three hours long and it will consist of long answer questions. The coverage is cumulative but a larger emphasis – more than 50% – will be placed on material in the last one third of the course. You will be allowed three double-sided sheets of notes. You shouldn't need it but you may bring a non-programmable calculator.
Practice problems for the final exam can be found here.
I recommend that you review previous homework and quizzes as well as the solutions to the questions of each that are on Canvas. The references listed in the weekly summaries posted below are worth reviewing as well. Finally, I will hold extra office hours on April the 10th and 11th from 2:00 - 3:00 PM.
Week 1
Concepts: Types of learning, supervised learning problem setup, regression, classification, loss, risk, oracle (Bayes) risk and predictors, conditional expectation, expected risk.
References: A more in-depth treatment of the topics covered this week can be found here in Chapter 2.
Week 2
Concepts: Empirical risk minimization, bias-variance decomposition, how to fit linear models and makes predictions in R, manipulation of means and covariances of random vectors, inner products, angles between vectors, orthogonal vectors, orthogonal matrices, the singular value decomposition, the spectral decomposition.
References: For the bias-variance decomposition see Introduction to Statistical Learning with R (ISLR) Section 2.2 and Elements of Statistical Learning (ESL) Section 7.3. In particular, equation (7.9) of ESL is essentially the bias-variance decomposition we derived in class, except that we also took an expectation over x_0. Section 3.3.1 of Izenman's Modern Multivariate Statistical Techniques (MMST) discusses means and covariances of random vectors; equations (3.92) and (3.93) are very important. MMST also discusses the full SVD in 3.2.6 for a short and wide matrix (take a transpose to get the full SVD for a tall and narrow matrix). The full SVD is almost the same as the thin SVD discussed in class except that the full SVD includes extra, unnecessary rows/columns in the constituent matrix factors.
Week 3
Concepts: The linear regression model, ordinary least squares estimates of regression coefficients, bias and variance of the OLS beta along with the bias and variance of predictions, prediction interval based on the t-distribution, existence of the OLS estimator, feature transformations, regression with categorical predictors, interaction effects, overfitting and issues with including too many features, AIC and BIC as variable selection criteria.
References: ISLR: 3.1-3.3, 7.1-7.3, ESL: 3.1-3.2.
Week 4
Concepts: Optimizing AIC and BIC via forward or backward selection, properties and the various optimization problem formulations of ridge regression and the LASSO, data splitting, K-fold cross-validation.
References: ISLR: 6.1-6.2, 5.1, ESL: 3.3-3.4, 7.10.
Week 5
Concepts: See the following document for a motivation of logistic regression via a model for the joint density of (y, x). Leave-one-out cross-validation. Introduction to the logistic function and logistic regression. How to make predictions using a logistic regression model. Finding the beta coefficients in logistic regression via maximum likelihood estimation. An introduction to the gradient descent algorithm.
References: ISLR: 5.1.1-5.1.4, 4.1-4.3. Some discussion on gradient descent and the Newton-Raphson algorithm (to be discussed next week) can be found here.
Week 6
Concepts: The Newton-Raphson optimization algorithm. Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and naive Bayes.
References: ISLR: 4.4-4.5. See the note from last week for Newton-Raphson.
Week 7 – Reading Week
Week 8
Concepts: Classification boundaries for multi-class problems. Basis functions. Regression splines as piecewise polynomials with smoothness and continuity constraints. Counting the number of basis functions for splines. Smoothing spline optimization problem – solution is a regularized, natural cubic spline with knots at every training feature. The smoothing matrix and effective degrees of freedom as an approximate parameter count. Computational issues with smoothing splines for large "n". Kernel smoothers as a weighted average of nearby responses with a measure of similarity provided by the kernel. Boxcar, Gaussian, and Epanechnikov kernels. The effect of the bandwidth parameter on the flexibility of the kernel smoothed fit.
References: ISLR: 7.3-7.5, ESL: 5.2, 5.4, Have a look at equation (5.30) to see how smoothing splines can be applied to logistic regression. 6.1-6.2 (Note ISLR does not have any material on kernel smoothing).
Week 9
Concepts: Local polynomial regression. Bias of KNN and kernel smoothing near boundaries and near maxima or minima. How to use these smoothing methods for classification. Smoothing methods for higher dimensional features by changing the distance or kernel. The importance of standardizing features and choosing features that are relevant in KNN or kernel smoothing. The curse of dimensionality. Generalized additive models and the backfitting algorithm.
References: ISLR: 7.6-7.7, ESL: 6.1.1-6.1.2, 6.3, 9.1.
Week 10
Concepts: Classification and decision trees (CART). How they are fit by minimizing impurity measures within rectangles. How to choose how big of a tree to grow (usually involves cross-validation!). Ensemble methods including model averaging, bagging, and random forests.
References: ISLR: 8.1, 8.2.1, 8.2.2. ESL: 8.7-8.8, 9.2
Week 11 - Guest Lecture
Concepts: The material in ISLR sections 10.1, 10.2 and 10.7 is was the focus of this week. Material related to these sections may appear on the final exam. The study of neural networks – deep learning – is a enormous field. If you are interested in learning more, one canonical reference is this book. See also the other sections in chapter 10 of ISLR and chapter 11 of ESL.
Week 12
Concepts: The idea behind clustering. The k-means algorithm: derivation of the iterations, convergence of the algorithm, the importance of scaling your data, how to choose k. Hierarchical clustering: types of dissimilarity measures between clusters including complete, average, and single linkage, the dendrogram and how to interpret it, brief discussion on divisive clustering and how clustering can be extended to more exotic objects like DNA sequences. A introductory discussion on Gaussian mixture models.
References: ISLR 12.4, 14.4.1-14.3.6, 14.3.12
Week 13
Concepts: The EM algorithm updates for Gaussian mixture models (GMMs) and their interpretation. The (tautological) observation that generative models are able to generate, new, never before seen data. Brief mention on the flexibility of GMMs and the fact that they can fit any continuous probability distribution for large enough k. Two perspective on PCA as variance maximization or distance minimization of projections. The PCA solution for the optimal projection of points onto an affine subspace. The principal component directions of the affine subspace and the principal component scores of the projected points. How to use the PC scores in compression, visualization, and as input for supervised learning algorithms. A discussion of principal components regression that uses the PC scores as input features in a linear regression. The link between the eigenvectors of the sample covariance matrix and the PC directions as well as the eigenvalues and the reconstruction error of PCA. The choice of the dimension k of the affine subspace via scree/elbow plots of the eigenvalues of the sample covariance.
References: ISLR: 6.3.1, 12.2, ESL: 8.5.1, 14.3.7., 14.5.1. Neither ISLR or ESL has an extensive discussion on GMM. For this I recommend Section 9.2 of Bishop.
Week 14 (Last partial week of class)
Concepts: (Classical) Multidimensional scaling as a method for obtaining low dimensional embeddings of data into R^d using only a distance matrix. How to derive the embedding when the distance is assumed to be the standard Euclidean distance in R^p. Relationship between the MDS Gram matrix and the principal component scores. Scree plot of the eigenvalues of the Gram matrix for finding a reasonable dimension d of the embedding space. Curves, surfaces and higher-dimensional surfaces and the difference between intrinsic (shortest-path) and extrinsic distances on such manifolds. The k-NN graph as a way to approximate the manifold that a point cloud of data lies close to. Distances on the k-NN graph approximate the intrinsic distance of the underlying manifold. Effects of the choice of k in the k-NN graph, too large we lose the manifold structure, and too small we might end up with a disconnected graph. The isomap algorithm which first constructs a k-NN graph to obtain shortest-path distances within the graph and then subsequently runs MDS on these distances to get an embedding into R^d.
References: ESL has a little bit of material in Chapter 14 but it is not extensive. Instead, I recommend looking at Izenmann Section 13.6 for MDS. There are some nice illustrative examples in this section. For Isomap see Izenmann Section 16.6.3.
Grading Scheme Update: If you do better on quiz 2 than on quiz 1 your quiz 2 grade will replace your quiz 1 grade. Importantly, if you do better on quiz 1 your quiz 1 grade will not replace your quiz 2 grade.
Course Description: The course focuses on statistical learning techniques, in particular those of supervised classification, both from statistical (logistic regression, discriminant analysis, nearest neighbours, and others) and machine learning background (tree-based methods, neural networks, support vector machines), with the emphasis on decision-theoretic underpinnings and other statistical aspects, flexible model building (regularization with penalties), and algorithmic solutions. Selected methods of unsupervised classification (clustering) and some related regression methods are covered as well.
Prerequisites: It is expected that you are very comfortable with regression techniques at the level of at least STAT 378. Proficiency with multivariable calculus, linear algebra and standard probabilistic computations (computing expectations, covariances, conditional probability distributions, manipulation of multivariate normal distributions etc.) is also required.
Grading:
Grade breakdown
5 assignments for 35% of the total grade. The lowest assignment grade is dropped.
2 quizzes, each worth 15%.
The final exam is worth 35%.
Assignments: All assignments are to be submitted on Canvas. You may scan handwritten solutions or write up solutions in LaTeX (preferred). If you choose to write up your solutions by hand please make sure that they are legible. For coding questions please submit relevant code chunks and output as part of your solution, while also including your raw code in a separate file. Assignments are meant to be completed individually without the assistance from your peers or generative AI models.
Late policy: 25% is subtracted from the grade of a given assignment for every day that this assignment is late. Assignments are due at 11:59 PM MST on the day indicated in the syllabus.
Resources:
There is no required textbook for this course. However, we will loosely be following
The Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2009). This book can be downloaded here.
Three other books that may be useful are:
An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani (2013). This book can be downloaded here.
Pattern Recognition and Machine Learning by Bishop (2006). This book can be downloaded here.
Modern Multivariate Statistical Techniques by Izenman (2008). This book can be downloaded here.
Software: We will be using R throughout this course. If you wish to use Python for your assignments you are free to do so.
Class Time, Office Hours, and Contact Information:
Class time: Monday, Wednesday and Friday, 11:00-11:50 AM, CAB 281.
Office hours: Monday and Wednesday, 10:00-10:50 AM, CAB 475.
My email is: mccorma2[AT]ualberta[DOT]ca
Tentative Outline:
Overview of different types of learning.
Introduction to supervised learning.
Measures of performance of a learning algorithm.
Bias-variance tradeoff.
Review of linear regression.
Variable selection and penalized regression methods.
Cross-validation.
Introduction to classification: logistic and multinomial regression.
Iterative optimization techniques: gradient descent and Newton-Raphson.
LDA, QDA, and naive Bayes.
Non-parametric regression – splines.
Non-parametric regression – local regression methods.
Generalized additive models.
Classification and Regression trees.
The bootstrap, bagging, and random forests.
Feedforward neural networks.
Convolutional and equivariant neural networks.
Support vector machines.
Unsupervised learning: K-means and hierarchical clustering.
Principal components analysis and kernel PCA.
Multidimensional scaling.
Isomap and manifold learning.
Factor analysis.
Directed graphical models.
Undirected graphical models.
Autoencoders and restricted Boltzmann machines.
Note: Many of the above topics will take more than one lecture to discuss. While I hope to cover most of the topics listed above, due to time constraints some topics, particularly those in the latter half of the course, will likely be omitted.
The first quiz is on February the 5th in class. You are allowed one double-sided sheet of course notes for the quiz. No calculators or other electronic devices are permitted. The quiz consists of a true or false section and a long answer section.
Topics included on the quiz include:
Decision theory
Manipulations means and covariances of multivariate data
Linear regression
Variable selection and regularized linear regression methods such as ridge regression and the LASSO. Cross-validation will not appear on the quiz.
Recall that if you do better on quiz 2 than quiz 1 then your mark for quiz 2 will replace your mark for quiz 1. Importantly, your mark for quiz 1 will not replace replace your mark for quiz 2 if you did better on quiz 1.
Some practice questions for quiz 2 can be found here. There will be no questions about GAMs on the quiz so you may ignore practice problem 8.
Similar to quiz 1 you will be allowed one double-sided sheet of notes. You shouldn't need it but you may bring a non-programmable calculator. The exam is to be held in class on March 12th for a duration of 50 minutes. It will consist of long-answer questions.
Coverage:
Cross-validation.
Logistic and multinomial regression.
Gradient descent and Newton-Raphson.
Generative modelling techniques including LDA, QDA, and Naive Bayes.
Univariate regression spline models and smoothing spline models.
Kernel smoothing, local regression, and KNN regression/classification.