CS 6106: STATISTICAL LEARNING THEORY

Instructor: Avishek Ghosh, Assistant Professor, Dept. of CSE, IIT Bombay

Contact: Room KR-116, Kresit Building; Email: avishek@cse.iitb.ac.in

Timing: Tuesday/Friday 3:30 - 4:55 pm

Classroom: KR-225 (Kresit Building) (Changed from LC 001, Lecture Hall Complex-2)

TA: Deeksha Koul (email: 24d0365@iitb.ac.in)

Ravi Sah (email: 24m0854@iitb.ac.in)

Harekrushna Sahu (24D1594@iitb.ac.in)

Soumen Mondal (soumenkm@iitb.ac.in).

Scribe Format: Here

Scribe Schedule: Here

About the course: The content is roughly divided into 7 parts:

1. Empirical Process Theory: Empirical Risk Minimization, Concentration of Measure, Hoeffdings, Macdiarmid`s, Azuma, Bernstein lemma, Generalization.

2. Uniform Law of large numbers: Glivenko Cantelli, Rademacher Averages, Covering, Packing, Metric Entropy, VC dimension,Chaining, Dudley Entropy Integral.

3. M estimation: Rates of convergence for M estimators, Consistency, Localization, Lipschitz regression.

4. Sparse linear regression: Gaussian sequence model, convex penalized regression, Hard and Soft thresholding estimator, LASSO, rates of convergence for LASSO.

5. Principle Component Analysis: Applications, compression, Eigen value and Eigen vector perturbations, Spiked ensemble model

6. Reproducing Kernel Hilbert Spaces (RKHS): Hilbert spaces, Kernels, Reproducing property, Linear functional, Interpolation and Fitting in RKHS

7. Minimax lower bounds: Decision Theoretic framework, Bayes approach, Multi-hypothesis formulation, Fano`s lemma, Applications

Grading: HWs (20%), 1 mid term (30%), Final (30%), Scribe (15%) and Class participation (5%)

References:

1. High Dimensional Statistics, A non-asymptotic approach-Martin J Wainwright, Cambridge University Press, 2022.

2. Asymptotic Statistics- Aad van der Vaart, Cambridge University Press, 1998.

3. The Elements of Statistical Learning Theory-T. Hastie, R. Tibshirani, J. Friedman, Springer, 2009.

4. Empirical Processes in M estimation-Sara Van de Geer, Cambridge University Press, 2000.

5. All of Statistics-Larry A. Wasserman, Springer, 2004.

6. Concentration Inequalities-S. Boucheron, P. Massart, G. Lugosi, Oxford Press, 2013.

7. High Dimensional Probability-R. Vershynin, Cambridge University Press, 2018

Apart from these references, we will follow the lecture notes of Prof. Peter Bartlett (UC Berkeley), Prof. Aditya Guntuboyina (UC Berkeley), Prof. Arya Mazumdar (UC San Diego) on similar courses.

Lectures Notes

1. Lecture 1: Introduction, Binary Classification, M-estimation, Uniform Law of Large Numbers

2. Lecture 2 : Concentration of Measure, Macdiarmid's inequality, Hoeffding's inequality, Cramer-Chernoff method

3. Lecture 3 : Central Limit Theorem vs Hoeffding's, Confidence Intervals, Sub-Gaussian random variables

4. Lecture 4 : Properties of Sub-Gaussian random variables, Martingale-based concentration inequalities, Examples, Doob Martingale, Martingale Difference, Azuma-Hoeffding’s inequality, Bernstein’s inequality

5. Lecture 5 : Proof of McDiarmid’s Inequality, Concentration of measure for Uniform Law of Large Numbers, Talagrand's inequality, Control of Expected supremum, Rademacher Complexity, Examples

6. Lecture 6 : Symmetrization argument, Bound on Expected supremum, Massart’s Lemma, Application to Uniform Law of Large Numbers

7. Lecture 7 : Polynomial Discrimination, Shattering, VC Dimension, Examples, VC-dimension and Uniform Law, (SSVC) Lemma, Introduction to Chaining

8. Lecture 8 : Covering Number, Packing Number, Examples, Volumetric argument, Covering-Packing of function classes, Parametric classes

9. Lecture 9 : Packing/Covering Number of non-parametric function spaces; ex: Lipschitz Functions on the unit interval/cube, Monotone Functions on the unit interval, Holder Class, Sub-Gaussian Processes over a metric space, One-Step Discretization, Applications

10. Lecture 10 : Motivation for Chaining, Dudley’s Entropy Integral Bound, Hierarchical Covering, Chaining Approximation, Proof of Dudley's method

11. Lecture 11 : Final Bound on Expected Supremum via Chaining, Application for Boolean Functional Classes, Uniform Law for Boolean Function Classes

12. Lecture 12 : Generalization error for Binary Classification, Glivenko-Cantelli Theorem, From Boolean to real valued Function Classes, VC Subgraph Dimension, Fat Shattering dimension, Mendelson and Vershynin bound on expected supremum

13. Lecture 13 : M-Estimation, Examples, M-Estimation as Empirical process, Basic Inequality, Measure of discrepancy, Localization argument

14. Lecture 14 : Rigorous derivation of M-estimation rates, Localization, Rate Theorem, Application in Non-parametric Regression, Lipschitz Regression

15. Lecture 15 : Recap of Rate Theorem, Local Dudley Entropy Integral, Proof for Bounded Lipschitz regression, Bounded Lipschitz regression, Relaxation on assumptions of Rate Theorem

16. Lecture 16 : Sparse Linear Regression, Applications in compressed sensing, Solution in Noiseless case, Restricted nullspace property, Recovery

17. Lecture 17 : Sufficient Conditions for Restricted nullspace property, Pairwise Incoherence, Restricted Isometry Property, Reconstruction with noise, LASSO, Statistical rates

18. Lecture 18 : Restricted Eigenvalue condition, Bounds on Estimation Error for LASSO, Constrained LASSO and Relaxed Basis Pursuit, Prediction error rates

19. Lecture 19 : Principal Component Analysis (PCA), data compression, low rank matrix approximation, Eckart-Young-Mirsky Theorem, PCA for solution to mixture models

20. Lecture 20 : Perturbation of eigenvalues and eigenvectors, Weyl’s theorem, Eigen-vector perturbation results, Davis-Kahn sinθ theorem, Spiked Ensemble Model, Sparse PCA

21. Lecture 21 : Reproducing Kernel Hilbert Spaces (RKHS), Cauchy sequences and Hilbert spaces, Examples of Hilbert spaces, Bounded Linear Functional, Kernel and Feature maps, Reproducing property

22. Lecture 22 : Construction of RKHS from a PSD kernel, Existence, Uniqueness, Another View of RKHS: Linear Functional Perspective, Riesz Representation Theorem, Evaluation functional, Example of Sobolev RKHS

23. Lecture 23 : Mercer’s Theorem, Examples of Mercer’s Theorem, Examples, Sobolev spaces, Translation Invariant Kernel, Gaussian Kernel, Interpolation in RKHS, Minimum norm interpolation

24. Lecture 24 : Fitting via Kernel Ridge Regression, Minimax Bounds, Minimax risk, Packing argument, connection to M-ary testing problem

25. Lecture 25 : Basic Framework for Minimax risks, From Estimation to Testing, Divergence Measures, Pinsker’s Inequality, Le-cam’s Inequality, Binary Testing and Le Cam’s Method, example: Mean estimation in Gaussian location family

26. Lecture 26 : Le Cam method for functionals, Disadvantage of Le-Cam’s Method, Fano’s Method: KL Divergence and Mutual Information, Gaussian Mean Estimation, Some Results using Fano’s Method

General Guidelines for Homeworks:

Homeworks should be submitted in class. Students are encouraged to discuss among themselves while solving the HW problems. However, they are required to write the solution on their own. Near-identical submissions will not be awarded any score.

Page updated

Report abuse