MT875

MT 875: Graduate topics in deep learning theory

Fall 2023

Spacetime coordinates: WF, 2-3:15, Maloney 560

Instructor: Eli Grigsby (she/her)

E-mail address: Delete all but one of the j's and replace "at" and "dot" with their symbols in the following expression: grigsbyjjjjjjjj(at)bc(dot)edu. (IMPORTANT: If you accidentally delete ALL the j's, your e-mail will accidentally go to someone that isn't me.)

Office hours: Maloney 522, by appointment (feel free to sign up for multiple adjacent slots, and e-mail me with your rough availability if you don't see a time that works for you)

Course description: This is a course on the mathematical and statistical foundations of learning theory, with an emphasis on applications to neural networks and gradient-based learning algorithms. As my background is in geometric topology, I will focus primarily on the portions of the subject where geometry, topology, and combinatorics play a prominent role.

The course will be divided into thirds:

1/3) Intro to supervised learning, computational learning theory, PAC learning framework: I will follow the first few chapters of Kearns and Vazirani's Introduction to Computational Learning.* Our goal here will be to understand how the classical notions of VC dimension, Rademacher complexity, and their cousins provide a means for quantifying the degree to which learning with a particular function class and finite sample data affects the ability of the learned model to generalize to unseen data drawn from the same distribution

2/3) Feedforward ReLU neural networks as a function class: I will prove basic properties of this function class (e.g., that it is precisely the class of finite piecewise linear functions and hence a universal approximator) and develop the foundations for calculating some of the classical statistical notions of complexity, along with some newer geometric and topological notions of complexity. We will also dig into an established relationship to tropical geometry.

3/3) Additional topics, as time permits: Neural tangent kernel and neural networks in the infinite-width limit, overparameterized networks and double descent, geometry of the loss landscape, the role of step size and noise in gradient dynamics, reinforcement learning, the role of attention in transformer networks for large language models

Prerequisites: No prior background in machine learning or learning theory will be assumed, but I will assume a degree of mathematical maturity. It would be helpful if you've taken the introductory grad G/T sequence or its equivalent (algebraic and differential topology).

Course work: I will ask all registered students to choose a topic and either 1) produce one detailed lecture or tutorial to be given in class or posted on-line, or 2) produce some experiments, along with a visualization and brief blurb to be presented in class or posted on-line.

I include below a list of references I plan to use this semester. If I add more later, I will do so at this google doc:

PAC Learning, Generalization, & Statistical Notions of Complexity:

Kearns, Vazirani: An introduction to computational learning theory*
Sontag, E.: VC dimension of neural networks
Shalev-Shwartz, Ben-David, Part I of Understanding Machine Learning: Theory and Algorithms
Ovchinnikov, Sergei: Max-min representations of piecewise-linear functions

Basics of supervised learning and geometry of ReLU networks:

M. Nielsen, Neural networks and deep learning
Grigsby, Lindsey: On transversality of bent hyperplane arrangements and the topological expressiveness of ReLU neural networks
M. Masden: Algorithmic determination of the combinatorial structure of the linear regions of ReLU networks
Grigsby, Lindsey, Meyerhoff, Wu: Functional dimension of ReLU networks
Grigsby, Lindsey, Rolnick: Hidden symmetries of ReLU networks

Neural tangent kernel and Gaussian processes:

See references and code at this github page

Tropical geometry and ReLU networks:

Zhang, L. et.al.: Tropical geometry of deep neural networks
Haase, C. et.al.: Lower bounds on the depth of integral ReLU neural networks via lattice polytopes

Implicit regularization and geometry of loss landscape:

Y. Cooper: Loss landscape of overparameterized networks
Damian, Lee, Ma: Label noise SGD provably prefers flat global minimizers

Attention & Transformer networks:

Olah, C.: A mathematical framework for transformer circuits
Vaswani, A. et.al.: Attention is all you need

Other useful (for me) background references on geometric topology:

Glen Bredon, Topology and Geometry
Victor Guillemin and Alan Pollack, Differential Topology
Allen Hatcher, Algebraic Topology*

*These texts may or may not be available in electronic form! Email me for assistance if you have trouble obtaining them.

What we actually did:

Week 1: Background on supervised learning problems, decision/optimization problems, probability. Definition: PAC learnable concept classes. The concept class of axis-aligned rectangles in R^2 is efficiently PAC learnable; References: Sections 1.1-1.2, Kearns-Vazirani
Week 2: Representations of concepts and size; The concept class of conjunctions of boolean literals is efficiently PAC learnable; Definition: Occam learnable concept classes and relation to PAC learnable concept classes (for finite instance spaces); References: Sections 1.2-1.3, 2.1, Kearns-Vazirani
Week 3: Shattered sets, Vapnik-Chervonenkis (VC) dimension and many examples; extracting boolean-valued concept classes from real-valued concept classes; Definition: Polyhedral set (intersection of finitely many closed affine-linear half-spaces aka the solution set of finitely many affine-linear inequalities); References: Sections 3.1-3.3, Kearns-Vazirani; Sections 1-3, Sontag
Week 4: For vector spaces of real-valued concept classes, the dimension is equal to the VC dimension. Sauer's lemma (relating growth function of number of sign assignments and VC dimension) + an unexpected relationship to enumerating regions of (generic) hyperplane arrangements; any boolean function can be represented by a neural network with step activation and one hidden layer; References: Sections 1-3, Sontag
Week 5: Difference classes, epsilon nets and the fundamental theorem of PAC learning: that concept classes with finite VC dimension are efficiently PAC learnable; References: Section 3.5, Kearns-Vazirani
Week 6: An upper bound on the VC dimension of a post-composition of a fixed boolean with k finite-VCD concept classes; understanding how regions of hyperplane arrangements in parameter space index sign assignments for a finite set in the instance space
Week 7: Zaslavsky's theorem and a combinatorial formula for the upper bound on the growth function of sign sequences on m points in R^{d} for the concept class of open affine-linear half-spaces; Introduction to understanding the topology and geometry of decision regions for functions in the concept class of feedforward ReLU neural networks; References: Grunert: Piecewise-linear Morse theory, G-Lindsey: "On transversality...", Masden: "Algorithmic determination..."
Week 8: Polyhedral sets, polyhedral complexes and maps that are linear on cells, level set complexes, transversality and supertransversality for polyhedral complexes; the canonical polyhedral complex associated to a ReLU neural network function; Masden's sign sequence cubical complex for computing decision regions and boundaries. References: Grunert: Piecewise-linear Morse theory, G-Lindsey: "On transversality...", Masden: "Algorithmic determination...", G-Lindsey-Masden: "Local and global topological complexity measures..."
Week 9: Almost all (away from a set of Lebesgue measure 0 in parameter space) ReLU networks are generic & supertransversal; Symmetries of parameter space for ReLU networks and functional dimension as a local complexity measure; Inhomogeneity of functional dimension over parameter space. References: G-Lindsey-Meyerhoff-Wu: "Functional dimension...", G-Lindsey-Rolnick: "Hidden symmetries..."
Week 10: Agnostic PAC learnability, uniform convergence; Detailed derivation of a generalization bound for 0-1 loss and binary classification on a class with finite VC dimension, using Hoeffding's inequality. References: Chapters 4 & 6 of Understanding Machine Learning: Theory and Algorithms
Week 11: Regularization, stability, and strong convexity; Stable learning algorithms don't overfit; Kernels, Hilbert spaces, and the representer theorem. References: Chapters 13 & 16 of Understanding Machine Learning: Theory and Algorithms
Week 12: Neural networks in the infinite width limit as Gaussian processes, Neural tangent kernel and lazy training in the infinite width limit. References: Jacot-Gabriel-Honger: "NTK: Convergence and Generalization...," Hanin: "Random NNs in the Infinite Width Limit...", Hanin: "Neural networks at large depth and width"

Exercises that may be useful for learning this material will appear here.