Francis Bach, Learning Theory from First Principles
This is the primary textbook for the course. It provides a self-contained, mathematically rigorous introduction to statistical learning theory. Beginning with fundamental tools from probability, linear algebra, and convex analysis, the book gradually builds up the core results of learning theory and their algorithmic counterparts. Each chapter combines theoretical results, analytic examples, and computational experiments. The text emphasizes both intuition and rigor, making it particularly well-suited for mathematically mature students.
Code repository: Python and MATLAB code for reproducing figures and experiments from the book:
👉 GitHub: Learning Theory from First Principles
Custom handouts prepared for this course summarize the key results and take-home messages from each lecture, and provide quick-reference formulas for concentration inequalities, bias–variance decomposition, regression solutions, and generalization bounds.
Yaser S. Abu-Mostafa, Learning from Data
A classic Caltech course, taught by Feynman Prize winner Prof. Yaser Abu-Mostafa. The 18 recorded lectures, together with incremental slide decks, provide a highly intuitive introduction to fundamental learning concepts: generalization, bias–variance, linear models, overfitting, regularization, support vector machines, kernel methods, and neural networks. This is an excellent resource for pre-class preparation or conceptual reinforcement.
Roman Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science
This book develops the modern probabilistic toolkit that underpins much of learning theory. It covers sub-Gaussian and sub-Exponential random variables, concentration inequalities, covariance estimation, the Hanson–Wright inequality, and matrix Bernstein bounds. These tools provide the high-dimensional probability background needed for non-asymptotic analysis of regression, stochastic gradient methods, and kernel methods.
Felipe Cucker & Ding-Xuan Zhou, Learning Theory: An Approximation Theory Viewpoint
This book offers a complementary perspective rooted in functional analysis and approximation theory. It explores the decomposition of error into approximation and estimation terms, approximation spaces, RKHS theory, universality of kernels, and detailed convergence rates. It provides a rigorous theoretical foundation for understanding learning as an approximation process in infinite-dimensional spaces.
Felipe Cucker & Steve Smale, On the Mathematical Foundations of Learning
A seminal survey article published in the Bulletin of the American Mathematical Society (2002). This short paper outlines a unifying view of machine learning, highlighting the interaction of approximation theory, statistical estimation, and computational complexity. It is an excellent conceptual introduction to the mathematical landscape of learning theory, and a highly recommended complement to the more detailed textbooks.
Tomaso Poggio & Steve Smale, The Mathematics of Learning: Dealing with Data
Published in the Notices of the American Mathematical Society (2003, Vol. 50, No. 5, pp. 537–544).
This short survey introduces the mathematical challenges of learning from data, emphasizing the interplay between approximation theory, probability, and computational complexity. It complements the other foundational readings by providing a broad conceptual overview aimed at mathematicians.
Other references and discussion notes from a previous reading group on learning theory can be found at:
Bach’s book is the primary guide for lectures, exercises, and the overall structure of the course.
Abu-Mostafa’s lectures are highly recommended for conceptual reinforcement; they give accessible introductions to many of the same topics.
Vershynin’s text provides the probabilistic foundations needed for a deeper understanding of generalization bounds, SGD analysis, and random matrix concentration.
Cucker & Zhou give the approximation-theoretic perspective, which deepens understanding of RKHS, kernel methods, and approximation rates beyond the probabilistic view.
Cucker & Smale’s article is a conceptual map of the field, showing how approximation, probability, and complexity interact.
Poggio & Smale’s article is a concise, conceptual introduction, ideal for situating learning theory within the broader mathematical sciences.