Francis Bach, Learning Theory from First Principles
This is the primary textbook for the course. It provides a self-contained, mathematically rigorous introduction to statistical learning theory. Beginning with fundamental tools from probability, linear algebra, and convex analysis, the book gradually builds up the core results of learning theory and their algorithmic counterparts. Each chapter combines theoretical results, analytic examples, and computational experiments. The text emphasizes both intuition and rigor, making it particularly well-suited for mathematically mature students.
Code repository: Python and MATLAB code for reproducing figures and experiments from the book:
👉 GitHub: Learning Theory from First Principles
Felipe Cucker & Ding-Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
This is the secondary textbook for the course. It offers a complementary perspective rooted in functional analysis and approximation theory. It explores the decomposition of error into approximation and estimation terms, introduces approximation spaces, RKHS theory, universality of kernels, and detailed convergence rates. The book provides a rigorous theoretical foundation for understanding learning as a approximation in infinite-dimensional spaces.
The two textbooks provide different perspectives on learning theory: one from the statistical / probabilistic side, the other from the analytic / operator-theoretic side.
Supplementary Note
Overview: Different Learning Regimes in Modern Learning Theory, From Fixed Design to NTK
This short handout summarizes the key “learning regimes” discussed throughout the course - from fixed design linear models to kernel methods, Barron spaces, and the Neural Tangent Kernel (NTK) limit.
Yaser S. Abu-Mostafa, Learning from Data
A classic Caltech course, taught by Feynman Prize winner Prof. Yaser Abu-Mostafa. The 18 recorded lectures, together with incremental slide decks, provide a highly intuitive introduction to fundamental learning concepts: generalization, bias–variance, linear models, overfitting, regularization, support vector machines, kernel methods, and neural networks. This is an excellent resource for pre-class preparation or conceptual reinforcement.
Roman Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science
This book develops the modern probabilistic toolkit that underpins much of learning theory. It covers sub-Gaussian and sub-Exponential random variables, concentration inequalities, covariance estimation, the Hanson–Wright inequality, and matrix Bernstein bounds. These tools provide the high-dimensional probability background needed for non-asymptotic analysis of regression, stochastic gradient methods, and kernel methods.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of Machine Learning (2nd Edition, MIT Press, 2018)
A comprehensive and rigorous graduate-level textbook covering the foundations of statistical learning theory and its algorithmic counterparts. Topics include PAC learning, VC dimension, Rademacher complexity, stability, SVMs, boosting, kernels, online learning, and ranking. The second edition adds expanded chapters on model selection and information-theoretic methods, as well as updated material on concentration inequalities. The text balances theoretical depth with algorithmic analysis, making it an excellent complement to Bach’s book.
Felipe Cucker & Steve Smale, On the Mathematical Foundations of Learning
A seminal survey article published in the Bulletin of the American Mathematical Society (2002). This short paper outlines a unifying view of machine learning, highlighting the interaction of approximation theory, statistical estimation, and computational complexity. It is an excellent conceptual introduction to the mathematical landscape of learning theory, and a highly recommended complement to the more detailed textbooks.
Tomaso Poggio & Steve Smale, The Mathematics of Learning: Dealing with Data
Published in the Notices of the American Mathematical Society (2003, Vol. 50, No. 5, pp. 537–544).
This short survey introduces the mathematical challenges of learning from data, emphasizing the interplay between approximation theory, probability, and computational complexity. It complements the other foundational readings by providing a broad conceptual overview aimed at mathematicians.
Other references and discussion notes from a previous reading group on learning theory can be found at:
Bousquet, Boucheron, Lugosi (2003) — Introduction to Statistical Learning Theory
Concise overview of ERM, uniform convergence, VC dimension, Rademacher complexity, margins, and stability.
Bartlett & Mendelson (2002) — Rademacher and Gaussian Complexities: Risk Bounds and Structural Results
The modern, data-dependent capacity viewpoint; foundational for Rademacher-based bounds.
👉 PDF
Bousquet & Elisseeff (2002) — Stability and Generalization (JMLR)
Algorithmic stability as an alternative route to generalization bounds; links naturally to regularization.
👉 PDF
Blumer, Ehrenfeucht, Haussler, Warmuth (1989) — Learnability and the VC Dimension
Classic PAC result: finiteness of VC dimension characterizes learnability; clean historical foundation.
👉 PDF
Guedj (2019) — A Primer on PAC-Bayesian Learning
Self-contained introduction to PAC-Bayes bounds and algorithms; excellent for projects or deeper dives.
👉 PDF
Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, Bharath K. Sriperumbudur, “Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences” (arXiv:1807.02582)
A comprehensive review exploring the deep mathematical and functional connections between Gaussian processes, kernel methods, and reproducing kernel Hilbert spaces (RKHS). The paper highlights equivalences between probabilistic and deterministic formulations of nonparametric learning, bridging the perspectives of Bayesian inference and regularization theory.
Steve Smale & Ding-Xuan Zhou, Learning Theory Estimates via Integral Operators and Their Approximations, Constructive Approximation, 26(2):153–172, August 2007.
This foundational paper establishes quantitative learning theory estimates using the framework of integral operators in reproducing kernel Hilbert spaces (RKHS). It develops non-asymptotic error bounds for regularization schemes, connecting spectral approximation properties of integral operators with statistical estimation error and capacity control.
👉 ResearchGate link
Lorenzo Rosasco, Mikhail Belkin, & Ernesto De Vito, On Learning with Integral Operators, JMLR, 11(30):905−934, 2010.
This paper develops a spectral analysis of learning algorithms based on integral operators, connecting kernel methods to regularization theory and providing detailed sample complexity bounds in terms of the operator spectrum.
👉 https://jmlr.org/papers/v11/rosasco10a.html
Jaouad Mourtada & Lorenzo Rosasco, An Elementary Analysis of Ridge Regression with Random Design, Comptes Rendus. Mathématique, Volume 360 (2022), pp. 1055-1063, https://comptes-rendus.academie-sciences.fr/mathematique/item/10.5802/crmath.367.pdf
This work provides a non-asymptotic analysis of ridge regression under random design assumptions. It offers clear finite-sample bounds using operator-theoretic tools, clarifies the bias–variance decomposition in the spectral domain, and unifies the perspectives of random matrix theory and learning theory. A highly recommended modern treatment for bridging classical linear regression and kernel-based generalization analysis.
Bach’s book is the primary guide for lectures, exercises, and the overall structure of the course.
Cucker & Zhou give the approximation-theoretic perspective, which deepens understanding of RKHS, kernel methods, and approximation rates beyond the probabilistic view.
Abu-Mostafa’s lectures are highly recommended for conceptual reinforcement; they give accessible introductions to many of the same topics.
Vershynin’s text provides the probabilistic foundations needed for a deeper understanding of generalization bounds, SGD analysis, and random matrix concentration.
Mohri et al. provides a broad, modern textbook-level complement, with expanded topics (model selection, PAC-Bayes, online learning).
Cucker & Smale’s article is a conceptual map of the field, showing how approximation, probability, and complexity interact.
Poggio & Smale’s article is a concise, conceptual introduction, ideal for situating learning theory within the broader mathematical sciences.