Spacetime coordinates:
Aug. 26-30: MW 1-2:15, Boston College, Maloney 536
Sep. 3-Sep. 5: Harvard CMSA Math&ML program opening workshop, 20 Garden St; Cambridge, Room G10
Sep. 9-Oct. 25: TTh 2:30-3:45, Harvard CMSA Math&ML program, 20 Garden St; Cambridge, Room G10
Oct. 28-Dec. 6: WF 1:30-2:45, Boston College, 245 Beacon St; Room 104
Instructor: Eli Grigsby (she/her)
Here is the youtube playlist of the CMSA lectures, updated periodically.
Course description: This is a course on geometric aspects of deep learning theory. Broadly speaking, we'll investigate the question: How might human-interpretable concepts be expressed in the geometry of their data encodings, and how does this geometry interact with the computational units and higher-level algebraic structures in various parameterized function classes, especially neural network classes? During the portion of the course Sep. 3-Nov. 1 presented as part of the Math and Machine Learning program, we will explore the current state of research on mechanistic interpretability of transformers, the architecture underlying large language models like Chat-GPT.
I will begin with a brief introduction to the basics of supervised learning using neural networks, and then discuss:
the geometry and combinatorics of feedforward ReLU neural networks as piecewise linear function classes
neural networks are universal approximators: discrete and non-discrete versions
the role of the superposition hypothesis in mechanistic interpretability of neural networks
neural network architectures for sequence-to-sequence processing: recurrent neural networks and transformers
representing finite state automata using sequence-to-sequence architectures
geometric distortion in deep networks and the importance of residual connections
symmetries of overparameterized ReLU neural networks, optimization, and generalization
algorithmic computation of topological invariants of decision boundaries/regions for feedforward ReLU neural networks
ReLU neural networks and tropical geometry (time permitting)
Prerequisites: This course is primarily targeted to graduate students and advanced undergraduates in mathematics and theoretical computer science. No prior background in machine learning or learning theory will be assumed, but I will assume a degree of mathematical maturity. I will assume as a baseline that you've mastered the standard undergraduate mathematics sequence (linear algebra, algebra, analysis), and it would be very helpful if you've been through a standard introductory graduate sequence in geometry and topology (algebraic topology from e.g. Hatcher, differential geometry from e.g. Lee or Guillemin-Pollack)
Course work: I will ask all registered students at Boston College to choose a topic and either 1) produce one detailed lecture or tutorial to be given in class or posted on-line, or 2) produce some experiments, along with a visualization and brief blurb to be presented in class or posted on-line.
E-mail address: Delete all but one of the j's and replace "at" and "dot" with their symbols in the following expression: grigsbyjjjjjjjj(at)bc(dot)edu. (IMPORTANT: If you accidentally delete ALL the j's, your e-mail will accidentally go to someone that isn't me.)
Office hours: By appointment (e-mail me, or just find me before or after class to make an appointment)
I include below a list of references I plan to use this semester. If I add more later, I will do so at this google doc:
Introductions to learning theory for neural networks:
Nielsen, M.: Neural networks and deep learning
Shalev-Shwartz, Ben-David, Part I of Understanding Machine Learning: Theory and Algorithms
Sontag, E.: VC dimension of neural networks
Anthony, M. and Bartlett, P.: Neural network learning: theoretical foundations*
Kearns, Vazirani: An introduction to computational learning theory*
Geometry of feedforward ReLU networks and piecewise linear function classes:
Hanin, B.: Universal function approximation by deep neural nets with bounded width and ReLU activations
Grigsby, Lindsey: On transversality of bent hyperplane arrangements and the topological expressiveness of ReLU neural networks
Masden, M.: Algorithmic determination of the combinatorial structure of the linear regions of ReLU networks
Grigsby, Lindsey, Meyerhoff, Wu: Functional dimension of ReLU networks
Ovchinnikov, S.: Max-min representations of piecewise-linear functions
Mechanistic interpretability, superposition, and transformers:
Rudin, C. et. al.: Interpretable ML: Fundamental Principles and Grand Challenges
Anthropic, Elhage, et.al. (Dec. '21): A mathematical framework for transformer circuits
Anthropic, Elhage, et.al. (Sep. '22): Toy Models of superposition
Anthropic, Bricken, et.al. (Oct. '23): Towards Monosemanticity
Anthropic, Templeton, et.al. (May '24): Scaling Monosemanticity
Liu, Zhang et.al.: Transformers learn shortcuts to automata
Feng, Steinhardt: How do language models bind entities in context?
Representing finite state automata with sequence-to-sequence architectures:
Siegelmann, Recurrent neural networks and finite automata
Liu, Ash, Goel, Krishnamurthy, Zhang: Transformers learn shortcuts to automata
Tropical geometry and ReLU networks:
Zhang, L. et.al.: Tropical geometry of deep neural networks
Haase, C. et.al.: Lower bounds on the depth of integral ReLU neural networks via lattice polytopes
Wang, Sun: Generalization of hinging hyperplanes
Other useful (for me) background references on geometric topology:
Glen Bredon, Topology and Geometry
Victor Guillemin and Alan Pollack, Differential Topology*
Allen Hatcher, Algebraic Topology*
*These texts may or may not be available in electronic form! Email me for assistance if you have trouble obtaining them.
What we actually did:
Week 1: Crash course in supervised learning, introduction to feedforward ReLU neural networks as a parameterized function class, perceptrons as simple binary classifiers, universal representation theorem for finite PL functions on the closed unit cube, as presented in B. Hanin's 2017 paper. For some history about universal approximation results and some of the older literature on the effects of architecture on expressiveness, depth separation, etc. check out notes and references from Lectures 1-3 of E. Mossel's MIT course Mathematics of Deep Learning from 2017.
Week 2: Universal representation theorem (discrete version): every binary function can be represented by a neural network with sign activation, via disjunctive normal form. Introduced polyhedral complexes in preparation to present (next week): For almost all parameters there is an algorithm, beginning with the discretization of a ReLU neural network by a sign neural network, to compute the Z/2Z homology groups of all decision regions and decision boundaries of the associated ReLU neural network. See, e.g., Grunbaum's Convex polytopes for background on polyhedral sets and Stanley's Introduction to hyperplane arrangements for background on polyhedral sets and hyperplane arrangements.
Week 3: Defined and discussed the right notion of a typical ReLU neural network, in terms of transversality of the cells (polyhedral sets) in a canonical polyhedral complex described as a level set complex in the terminology of Grunert's Piecewise Linear Morse Theory, cf. this paper. Described Masden's main result in Algorithmic determination of the combinatorial structure of the linear regions of ReLU networks: that for all but a measure zero subset of parameters, the Z/2Z homology groups of every decision region and boundary of every non-input neuron can be algorithmically computed as an imbedded subcomplex of the cube complex, [-1,1]^N (where N is the number of non-input neurons).
Week 4: Briefly introduced N-gram language models following Jurafsky & Martin; gave a high-level introduction to the transformer architecture following Elhage etal., and a partial explanation of the statement they make that "Zero-layer transformers model bigram statistics." Described how to compute and understand attention heads from low-rank Query, Key, Value, and Output matrices (all with trainable parameters). For each attention head, the application of the attention pattern and the computations of the token values are linear, and act on the right and the left of the imbedded token matrix, respectively (hence commute). A composition of attention layers can therefore be viewed as virtual attention heads.
Week 5: Interpretable models have neurons aligned with human-interpretable concepts. What is meant by this statement? + a mathematical presentation of the superposition hypothesis, following this Circuits thread from Anthropic. (No class Thursday.)
Week 6: An introduction to parameter space symmetries, and the role they may play in explaining the double-descent phenomenon for modern parameterized function classes. Definition of global, local, and data-dependent symmetries in parameterized function classes for supervised learning. An enumeration of obvious global symmetries for feedforward neural networks, coming from the structure of the computational graph and the activation function, following this paper of Zhao, Ganev, Walters, Yu, Dehmamy. Relationship to the functional dimension, a local complexity measure defined here (see also this independent work), and explored further here. Brief mention of relationship between functional dimension and the neural tangent kernel.
Week 7: Circuity complexity and the Parity function. Representing the parity function and other symmetric binary functions using transformer architectures. We referenced Circuit complexity and Neural Networks by Parberry and reported on a construction in Inductive biases and variable creation in self-attention mechanisms by Edelman, Goel, Kakade, Zhang.
Week 8: Discussed representing semiautomata using transformer architectures, following Transformers learn shortcuts to automata, by Liu, Ash, Goel, Krishnamurthy, and Zhang. Noted that causal masking allows for easy representation of uniform attention to past tokens, by setting all query-key weights to zero. Defined transformation semigroups of automata, defined and discussed trivial and split extensions of groups and how they arise from the composition series of a finite group.
Week 9: In the first half of the week, we discussed non-linear representation theory, as a framework for understanding the construction(s) of G-actions for a finite group G from its composition series (on the residual stream of a transformer), as described in Transformers learn shortcuts to automata. Gave a first pass at a definition of a (non-linear) G-representation and (non-linear) G-equivariant maps between G-representations. In the second half of the week, we started reading Equivariant neural networks and piecewise-linear representation theory by Gibson, Tubbenhauer, Williamson. The G-equivariant neural networks (for finite G) that they study are a special type of non-linear G-representation, where the G-reps themselves are linear and the non-linear G-equivariant maps are of a very particular form (the composition of a linear G-equivariant map and a uniformly-applied activation function).
Week 10: Continued lecturing on Equivariant neural networks and piecewise-linear representation theory by Gibson, Tubbenhauer, Williamson. For finite G, the dimension of the parameter space for a G-equivariant neural network (whose underlying graded vector space V is, by definition, the permutation module associated to the action of G on a privileged basis, B, of V) is understood as follows: G acts on BxB. The G-orbits under this action form a basis for the space of G-equivariant linear maps V to V. We have one parameter for every G-orbit of BxB. Proved that the component-wise application of a (non-odd) piecewise-linear activation function is G-equivariant with respect to linear G-rep iff the G-rep is a permutation representation with respect to the privileged basis.
Week 11: Continued lecturing on Equivariant neural networks and piecewise-linear representation theory by Gibson, Tubbenhauer, Williamson. Discussed the more general piecewise-linear representation category they introduce (objects are linear G-representations, and morphisms are piecewise-linear G-equivariant maps). (Mostly) proved G-T-W's piecewise-linear Schur's Lemma: the set of morphisms from an irreducible G-rep V to an irreducible G-rep W is non-empty iff the kernel of the G-rep for V is contained in the kernel of the G-rep for W. In particular, faithful G-irreps admit PL morphisms to all other G-irreps. The constructive direction of the proof involves a clever imbedding of the regular G-representation into the PL endomorphisms of any faithful linear G-rep.
Week 12: Recalled (from this paper with Kathryn Lindsey) the definition of functional dimension for a piecewise polynomial parameterized function class, and introduced a few other notions of complexity for parameterized function classes that are relevant for establishing generalization bounds (i.e., for obtaining upper bounds on the difference between the true risk (error/loss) for a parameter and the empirical risk on a finite batch of data). These are the VC dimension and the pseudodimension, which are natural statistical measures of capacity of a binary-valued function class and real-valued function class, respectively. We proved that the (batch) functional dimension is a lower bound on the (batch) persistent pseudodimension, a local version of the pseudodimension. We also proved that the batch persistent pseudodimension is bounded above by the row rank (over the real numbers) of the algebraic Jacobian matrix on the batch.
Week 13: Proved that for the class of ReLU networks we have another upper bound on the batch persistent pseudodimension, in terms of the rank of the activation matrix, which encodes the data of active neurons in the network, then explained why we conjecture that the functional dimension is equal to the persistent pseudodimension for ReLU network classes. Discussed how one obtains generalization bounds using the pseudodimension of a function class, following Chapters 4, 11, and 12 of Neural network learning: theoretical foundations by Anthony and Bartlett.