Francis Bach, Learning Theory from First Principles
This is the primary textbook for the course. It provides a self-contained, mathematically rigorous introduction to statistical learning theory. Beginning with fundamental tools from probability, linear algebra, and convex analysis, the book gradually builds up the core results of learning theory and their algorithmic counterparts. Each chapter combines theoretical results, analytic examples, and computational experiments. The text emphasizes both intuition and rigor, making it particularly well-suited for mathematically mature students.
Code repository: Python and MATLAB code for reproducing figures and experiments from the book:
👉 GitHub: Learning Theory from First Principles
Custom handouts prepared for this course summarize the key results and take-home messages from each lecture, and provide quick-reference formulas for concentration inequalities, bias–variance decomposition, regression solutions, and generalization bounds.
Yaser S. Abu-Mostafa, Learning from Data
A classic Caltech course, taught by Feynman Prize winner Prof. Yaser Abu-Mostafa. The 18 recorded lectures, together with incremental slide decks, provide a highly intuitive introduction to fundamental learning concepts: generalization, bias–variance, linear models, overfitting, regularization, support vector machines, kernel methods, and neural networks. This is an excellent resource for pre-class preparation or conceptual reinforcement.
Roman Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science
This book develops the modern probabilistic toolkit that underpins much of learning theory. It covers sub-Gaussian and sub-Exponential random variables, concentration inequalities, covariance estimation, the Hanson–Wright inequality, and matrix Bernstein bounds. These tools provide the high-dimensional probability background needed for non-asymptotic analysis of regression, stochastic gradient methods, and kernel methods.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning (2nd Edition, MIT Press, 2018)
A comprehensive and rigorous graduate-level textbook covering the foundations of statistical learning theory and its algorithmic counterparts. Topics include PAC learning, VC dimension, Rademacher complexity, stability, SVMs, boosting, kernels, online learning, and ranking. The second edition adds expanded chapters on model selection and information-theoretic methods, as well as updated material on concentration inequalities. The text balances theoretical depth with algorithmic analysis, making it an excellent complement to Bach’s book.
Felipe Cucker & Ding-Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
This book offers a complementary perspective rooted in functional analysis and approximation theory. It explores the decomposition of error into approximation and estimation terms, approximation spaces, RKHS theory, universality of kernels, and detailed convergence rates. It provides a rigorous theoretical foundation for understanding learning as an approximation process in infinite-dimensional spaces.
Felipe Cucker & Steve Smale, On the Mathematical Foundations of Learning
A seminal survey article published in the Bulletin of the American Mathematical Society (2002). This short paper outlines a unifying view of machine learning, highlighting the interaction of approximation theory, statistical estimation, and computational complexity. It is an excellent conceptual introduction to the mathematical landscape of learning theory, and a highly recommended complement to the more detailed textbooks.
Tomaso Poggio & Steve Smale, The Mathematics of Learning: Dealing with Data
Published in the Notices of the American Mathematical Society (2003, Vol. 50, No. 5, pp. 537–544).
This short survey introduces the mathematical challenges of learning from data, emphasizing the interplay between approximation theory, probability, and computational complexity. It complements the other foundational readings by providing a broad conceptual overview aimed at mathematicians.
Other references and discussion notes from a previous reading group on learning theory can be found at:
Bousquet, Boucheron, Lugosi (2003) — Introduction to Statistical Learning Theory
Concise overview of ERM, uniform convergence, VC dimension, Rademacher complexity, margins, and stability.
Bartlett & Mendelson (2002) — Rademacher and Gaussian Complexities: Risk Bounds and Structural Results
The modern, data-dependent capacity viewpoint; foundational for Rademacher-based bounds.
👉 PDF
Bousquet & Elisseeff (2002) — Stability and Generalization (JMLR)
Algorithmic stability as an alternative route to generalization bounds; links naturally to regularization.
👉 PDF
Blumer, Ehrenfeucht, Haussler, Warmuth (1989) — Learnability and the VC Dimension
Classic PAC result: finiteness of VC dimension characterizes learnability; clean historical foundation.
👉 PDF
Guedj (2019) — A Primer on PAC-Bayesian Learning
Self-contained introduction to PAC-Bayes bounds and algorithms; excellent for projects or deeper dives.
👉 PDF
Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, Bharath K. Sriperumbudur, “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences” (arXiv:1807.02582)
A comprehensive review exploring the deep mathematical and functional connections between Gaussian processes, kernel methods, and reproducing kernel Hilbert spaces (RKHS). The paper highlights equivalences between probabilistic and deterministic formulations of nonparametric learning, bridging the perspectives of Bayesian inference and regularization theory.
Bach’s book is the primary guide for lectures, exercises, and the overall structure of the course.
Abu-Mostafa’s lectures are highly recommended for conceptual reinforcement; they give accessible introductions to many of the same topics.
Vershynin’s text provides the probabilistic foundations needed for a deeper understanding of generalization bounds, SGD analysis, and random matrix concentration.
Mohri et al. provides a broad, modern textbook-level complement, with expanded topics (model selection, PAC-Bayes, online learning).
Cucker & Zhou give the approximation-theoretic perspective, which deepens understanding of RKHS, kernel methods, and approximation rates beyond the probabilistic view.
Cucker & Smale’s article is a conceptual map of the field, showing how approximation, probability, and complexity interact.
Poggio & Smale’s article is a concise, conceptual introduction, ideal for situating learning theory within the broader mathematical sciences.