# BC Math & Machine Learning Seminar

Fall 2023

11/21/23: The question of identifiability for ReLU neural networks, Joachim Bona-Pellissier (University of Toulouse)

The motivations for studying identifiability in the case of neural networks are diverse, ranging from the theoretical to the very practical. One is related to privacy and protection against model inversion attacks, another is related to interpretability, and yet another is related to obtaining theoretical guarantees for the optimization, such as good properties of the objective function and its local minima, or such as the reproducibility of the optimization process. Finally, an important motivation for studying identifiability is to characterize the complexity of the functions implemented by neural networks, a question related to the implicit regularization during the optimization process and to the generalization capabilities of neural networks. Intuitively, the more redundancies there are in the parameters, i.e. the less identifiable a network is, the less rich and complex the space of functions represented by the network is, and the more we can expect good generalization properties.

In this talk, I will present three results from my Ph.D thesis. First, I will present two identifiability results for fully-connected feedforward ReLU neural networks. Then, I will present a result on geometry-induced implicit regularization that derives from this work.

Spring 2023

2/9/23: Symmetries of deep learning models and their internal representations, Charlie Godfrey (Pacific Northwest National Labs)

Abstract: Symmetry has been a fundamental tool in the exploration of a broad range of complex systems. In machine learning, symmetry has been explored in both models and data. We seek to connect the symmetries arising from the architecture of a family of models with the symmetries of that family’s internal representation of data. We do this by calculating a set of fundamental symmetry groups, which we call the intertwiner groups of the model. Each of these arises from a particular nonlinear layer of the model, and different nonlinearities result in different symmetry groups. These groups change the weights of a model in such a way that the underlying function that the model represents remains constant but the internal representations of data inside the model may change. We connect intertwiner groups to a model’s internal representations of data through a range of experiments that probe similarities between hidden states across models with the same architecture. Our work suggests that the symmetries of a network are propagated into the symmetries in that network’s representation of data, providing us with a better understanding of how architecture affects the learning and prediction process. Finally, we speculate that for ReLU networks, the intertwiner groups may provide a justification for the common practice of concentrating model interpretability exploration on the activation basis in hidden layers, rather than arbitrary linear combinations thereof. Joint work with Davis Brown, Tegan Emerson and Henry Kvinge.

2/23/23: On the Role of Neural Collapse in Transfer and Few-Shot Learning, Tomer Galanti (MIT)

In a variety of machine learning applications, we have access to a limited amount of data from the task that we would like to solve, as labeled data is oftentimes scarce and/or expensive. In such cases, training directly on the available data is unlikely to produce a model that performs well on new, unseen test samples.

A prominent solution to this problem is to apply transfer learning. This approach suggests to pre-train a model on a large-scale source task, such as ImageNet, and fine-tuning it to fit the available data from the downstream task. Recent studies have shown that a single classifier's learned representations over multiple classes can be easily adapted to new classes with very few samples.

In this talk, we provide an explanation for this behavior based on the recently observed phenomenon of neural collapse. We examine the few-shot error of the learned feature map, which is the classification error of the nearest class-center classifier using centers learned from a small number of random samples from each class. We show that the few-shot error generalizes from the training data to unseen test samples and to new classes. This suggests that pre-trained models can provide feature maps that are transferable to new downstream tasks even with limited data available.

3/2/23: A Mathematical Lens on the Inner Workings of Deep Learning Models, Henry Kvinge (Pacific Northwest National Labs)

As both models and data grow, experiments have played an increasingly important role in driving the field of deep learning (DL) forward. We argue that even in this empirical setting however, mathematics has a lot to offer DL in terms of concepts, analytical tools, and frameworks. We provide two examples of this in this talk. In the first, we describe an approach to directly estimating the equivariance of a DL model to a specific group action. We show that targeted evaluations using such an approach can illuminate important aspects of model robustness and learning. Next, we describe how the notion of a frame can help us glimpse the ways that DL models process data and the manifolds from which they are drawn.

Here are the slides from the talk.

3/9/23: No seminar (spring break)

3/16/23: A Solvable Model of Neural Scaling Laws, Dan Roberts (MIT/Salesforce)

Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.

Here are the slides from the talk.

3/23/23: Experimental Math & Machine Learning Lab Kick-off event, Moon Duchin (Tufts)

Professor Duchin will give an informal talk & demo to math department faculty, postdocs, and grad students about how to download and work with some of the census data she uses in her work on geometry and gerrymandering. There will also be a mini-intro to google colab/jupyter notebooks and python

5/4/23: Which loss functions are Morse?, Yaim Cooper (Notre Dame)

Abstract: At the heart of many contemporary machine learning systems is a loss function that is minimized by a gradient based algorithm. One basic property of any function is whether it is Morse or not. In this talk, we ask if and when the loss function of a deep neural network is Morse. We focus on the setting of feedforward neural networks with an added regularizer and discuss some cases that are known, and some that are not known.

Fall 2022

11/8/22: The geometry of linear convolutional networks, Kathlén Kohn (KTH, Stockholm)

We discuss linear convolutional neural networks (LCNs) and their critical points. We observe that the function space (i.e., the set of functions represented by LCNs) can be identified with polynomials that admit certain factorizations, and we use this perspective to describe the impact of the network’s architecture on the geometry of the function space. For instance, for LCNs with one-dimensional convolutions having stride one and arbitrary filter sizes, we provide a full description of the boundary of the function space. We further study the optimization of an objective function over such LCNs: We characterize the relations between critical points in function space and in parameter space and show that there do exist spurious critical points. We compute an upper bound on the number of critical points in function space using Euclidean distance degrees and describe dynamical invariants for gradient descent. This talk is based on joint work with Thomas Merkh, Guido Montúfar, and Matthew Trager.

11/17/22, Thursday 4 PM, NOTE UNUSUAL DAY/TIME: Radial neural networks: universal approximation and model compression, Iordan Ganev (Radboud, Institute for Computing and Information Sciences)

Neural network activations conventionally apply the same function to each coordinate. In this talk, we will discuss alternative activations that rescale feature vectors by a function depending only on the norm. The resulting networks are called radial neural networks, and their parameter spaces exhibit rich orthogonal change-of-basis symmetries. Factoring out these symmetries leads to a practical lossless model compression algorithm; we provide a precise relationship between gradient descent optimization of the original and compressed models. Additionally, we explain a universal approximation theorem for such networks.

11/29/22: Designing Domain-specific Representations for Deep Learning in Connectomics, Donglai Wei (Boston College Computer Science)

Abstract:

The field of connectomics aims to reconstruct the brain's wiring diagram from nanometer-resolution 3D microscopy image volumes to enable new insights into the workings of brains. These new insights could inspire novel artificial intelligence algorithms and benefit the treatment development for neurodegenerative diseases. One primary computer vision task in connectomics is neuron segmentation, grouping raw image pixels into individual neurons or neuron compartments. However, existing deep learning models are ineffective on such tasks due to the limited and costly annotation and complex neuron morphology. In this talk, I will present domain-specific input and target representation designs for deep learning models to achieve state-of-the-art segmentation performances, leveraging knowledge about biological structures. First, for 3D volumetric segmentation, we designed the boundary-to-pixel direction representation (CVPR 2020), multi-view representation (MICCAI 21), and skeleton-based distance transformation (under review). Next, for 3D point cloud segmentation, we applied the Frenet-Serrer formula to twisted tubular structures to make the deep learning model invariant to tube morphology (under review).

Bio: Donglai Wei is an assistant professor in the Computer Science Department at Boston College. His research focuses on developing novel registration and reconstruction algorithms for large-scale (currently petabyte-scale) connectomics datasets to empower neuroscience discoveries. During his Ph.D. at MIT under Prof. William Freeman, he worked on video understanding problems, including arrow of time and Vimeo-90K benchmark. Since his postdoc at Harvard University, he has embarked on the quest to reconstruct the brain's wiring diagram in collaboration with Prof. Hanspeter Pfister, Prof. Jeff Lichtman, and Prof. Ed Boyden.

Summer 2022

7/12/22: Demystify Deep Network Architectures: from Theory to Applications, Wuyang Chen (University of Texas, Austin)

Deep neural networks significantly power the success of machine learning. Over the past decade, the community keeps designing architectures of deep layers and complicated connections. However, the gap between deep learning theory and application is growingly large. This talk will center around this challenge and tries to bridge the gap between the two worlds. By theoretically analyzing a network’s Jacobian, NNGP, and NTK, we find an intrinsic trade-off in network architectures. Given a space of architectures, a network cannot be optimal in its expressivity, trainability, and generalization at the same time, and it has to keep a balance between its depth and width. In other words, separately optimizing expressivity, trainability, and generalization will give us different network architectures. This analysis has further practical implications. Automated machine learning (AutoML) is a powerful tool to address design problems, yet, at the price of heavy computation costs during model training. Our theory serves as an accurate and efficient guidance for the architecture design. We propose to significantly accelerate AutoML with our theory-grounded, training-free metrics. Without any training cost, our TE-NAS framework can automatically design novel and accurate network architectures on ImageNet in only four GPU hours.

Bio: Wuyang Chen is a Ph.D. candidate in Electrical and Computer Engineering at University of Texas at Austin. Wuyang’s research focuses on theoretical understandings of deep network architectures and AutoML applications. Wuyang also worked on domain adaptation/generalization and self-supervised learning. His work is published on ICLR, ICML, CVPR, ICCV, etc. Wuyang completed his research internship in NVIDIA and Google Brain. Wuyang chaired the 4rd and the 5th version of UG2+ workshop and challenge in CVPR 2021 and 2022. Wuyang is also a board member of the One World Seminar Series on the Mathematics of Machine Learning.

Website: https://chenwydj.github.io/

Spring 2022

5/3/22: Exact Combinatorial and Topological Data for ReLU Networks' Linear Regions, Marissa Masden (University of Oregon)

Abstract: One goal at the interface of topological data analysis and machine learning is to “tune” the topology of a network to that of the data. To that end, we report substantial progress on understanding the topology of ReLU networks. We study the canonical polyhedral complex, recently defined by E. Grigsby and K. Lindsey, which encodes a ReLU network's decomposition of input space and thus its decision boundary. We find that while the polyhedral complex is composed of arbitrary combinatorial types, generically its geometric dual is a cubical complex. We call this the sign sequence cubical complex, and establish additional algebraic structure on it, extending similar structure from the theory of hyperplane arrangements. We use this to show that the locations and sign sequences of the vertices of a network's polyhedral complex, which can be computed recursively through network layers, fully determine the complex combinatorially and topologically. Computing the polyhedral complex by taking advantage of this structure is robust to floating point errors which can arise through standard approaches to polyhedral intersection, giving an effective algorithm to fully encode decision boundaries. Running empirics preliminarily indicates that the distribution of topological properties of shallow networks' decision boundaries at initialization is roughly constant as width varies, but those topological properties vary with width for deeper networks.

2/8/22: The Computability of PAC Learning, Julian Asilis (Boston College, Computer Science)

A recent research direction seeks to align learning theory with its computational intentions by considering PAC learning under the restriction that learners be computable (rather than merely measurable). I discuss several works in this area, beginning with a brief treatment of relevant concepts in computability theory. First, I discuss recent advances made toward characterizing the computability of PAC learning over N, including the demonstration of a class of finite VC dimension without any proper computable PAC learners. Subsequently, I present joint work on the computability of learning over more general metric spaces, including the demonstration of a computable learner whose sample functions are all noncomputable. Finally, I consider open questions in the area.

1/25/22: A Neural Network Ensemble Approach to System Identification, Elisa Negrini (Worcester Polytechnic Institute)

We present a new algorithm for learning unknown governing equations from trajectory data, using an ensemble of neural networks. Given samples of solutions x(t) to an unknown dynamical system dx/dt=f(t,x(t)), we approximate the function f using an ensemble of neural networks. We express the equation in integral form and use Euler method to predict the solution at every successive time step using at each iteration a different neural network as a prior for f. This procedure yields M-1 time-independent networks, where M is the number of time steps at which x(t) is observed. Finally, we obtain a single function f(t,x(t)) by neural network interpolation. Unlike our earlier work, where we numerically computed the derivatives of data, and used them as target in a Lipschitz regularized neural network to approximate f, our new method avoids numerical differentiations, which are unstable in presence of noise. We test the new algorithm on multiple examples both with and without noise in the data. We empirically show that generalization and recovery of the governing equation improve by adding a Lipschitz regularization term in our loss function and that this method improves our previous one especially in presence of noise, when numerical differentiation provides low quality target data. Finally, we compare our proposed method with other algorithms for system identification.

Here are the slides.

Fall 2021

12/7/21: Understanding and Accelerating Neural Architecture Search with Theory-Grounded Metrics, Atlas Wang (University of Texas, Austin)

In this talk, I will discuss on how to design a unified, training-free, and DL theory-grounded framework for Neural Architecture Search (NAS), with high performance, very low cost, and interpretation. NAS has been explosively studied to automate the discovery of top-performer neural networks but suffers from heavy resource consumption and often incurs search bias due to truncated training or approximations. Recent NAS works start to explore indicators that can predict a network's performance without training. By rigorous correlation analysis, we present a unified framework to understand and accelerate NAS, by disentangling essential theory-inspired characteristics of searched networks – Trainability, Expressivity, and Generalization (we call “TEG”), all assessed in a training-free manner. Our indicators could be scaled up and integrated with various NAS search methods, including both supernet and single-path approaches. Extensive studies validate the effective and efficient guidance from our framework. Moreover, we visualize search trajectories on the landscapes of those characteristics, which lead to the first interpretable analysis of various NAS algorithms’ behaviors on different benchmarks.

Reference:

11/16/21: Neural nets in natural language processing, Emily Prud'hommeaux (Boston College, Computer Science)

After many false starts over the past forty years, neural networks have become the dominant approach to machine learning for natural language processing (NLP). In this talk, I will describe and demonstrate a few simple but interesting ways neural nets are used in NLP, from representing lexical semantics to recognizing speech. Interested listeners can follow along with the demos using Colab or Python installed on their own machine.

10/19/21: Convex codes, neural networks, and oriented matroids, Alex Kunin (Baylor College of Medicine/University of Houston, Neuroscience/Mathematics)

Starting with a story about the hippocampus, I'll motivate an algebraic-topological view of neural activity, which mostly boils down to the coincidentally-named nerve theorem. This leads to a generalized sort of "inverse nerve" problem: given some collection of sets (not necessarily a simplicial complex), does it encode the intersections of some convex sets in R^d? I'll share my progress on answering this question and its connections to deep learning, both of which stem from hyperplane arrangements.

9/28/21: Beyond the Bias-Complexity Tradeoff, Jean-Baptiste Tristan (Boston College, Computer Science)

I will present some of the most important results in statistical learning theory (SLT) to give some context to the ongoing efforts to explain deep learning with neural networks. First, I will present the fundamental theorem of statistical learning theory that characterizes the generalizability of learning algorithms using VC theory. Second, I will explain how recalcitrant learning algorithms were analyzed in the framework of structural risk minimization, using concepts such as stability or duality. Finally, I will explain why existing results have failed to provide a convincing explanation to deep learning and review some of the promising approaches in light of SLT's past successes.

This talk will contain no original research.

Summer 2021

8/10/21: Intro to Graph neural networks, Cihan Soylu

8/3/21: Neural differential equations, Kathryn Lindsey

7/13/21: A fractional approach to regularization, Adebo Sijuwade (Washington State University)

Fractional calculus is an effective tool that has recently been used to improve the performance of gradient descent methods, some of the most common methods used to optimize neural networks. Caputo-based gradient methods have been effective over their integer-order equivalents due to their long memory characteristics but limited in that convergence to a local optimum is not guaranteed. To avoid overfitting, it is of interest to consider the role of gradient methods in regularization problems. In this talk, I will discuss a recently proposed gradient method based on a fractional derivative operator with smooth kernel and address its compatibility with L1 regularization.

Slides are here.

7/6/21: The critical locus of overparameterized neural networks (Y. Cooper) , Elisenda Grigsby

Spring 2021

## 4/6/21: Recent advances in the analysis of the implicit bias of gradient descent on deep networks, Matus Telgarsky (UIUC)

The purpose of this talk is to highlight three recent directions in the study of implicit bias --- one of the current promising approaches to trying to develop a tight generalization theory for deep networks, one interwoven with optimization. The first direction is a warm-up with purely linear predictors: here, the implicit bias perspective gives the fastest known hard-margin SVM solver! The second direction is on the early training phase with shallow networks: here, implicit bias leads to good training and testing error, with not just narrow networks but also arbitrarily large ones. The talk concludes with deep networks, providing a variety of structural lemmas which capture foundational aspects of how weights evolve for any width and sufficiently large amounts of training.

Fall 2020

## 12/1/20: On the topological expressiveness of neural networks, Elisenda Grigsby

I will describe a joint on-going project with K. Lindsey aimed at developing a general framework for understanding how the architecture of a neural network constrains the topological features of its decision regions.

11/24/20: No seminar (Thanksgiving)

11/17/20: Translation and Attention, Dalton Fung

I will start by introducing the language translation task, which is one of the many important tasks in Natural Language Processing (NLP). I'll then go over what the attention mechanism is, how it is invented to solve some flaws of the traditional models, and finally explain why it eventually becomes one of the most important architectures today in NLP tasks.

11/10/20: What are Random Forests? (Answer: neural networks), Adam Saltz

I'll give an overview of how random forests work, then describe why every random forest is a neural network.

11/03/20: Neural Network Initialization Processes, Marissa Masden

Abstract: Before neural networks are trained, their weights need to be assigned initial values. The initial distribution of weights affects properties of the network during training. I will discuss some common methods of random initialization for fully-connected feedforward networks, and some interesting corresponding properties of networks at initialization. I will then introduce a novel geometrically-inspired algorithm for initializing fully-connected networks called Linear Discriminant Sorting, developed together with my advisor, Dev Sinha. The success of this technique brings some intuition toward the geometric properties of neural networks.

10/27/20: On connected sublevel sets in deep learning (Q. Nguyen), Elisenda Grigsby

I'll be talking about this paper, which proves that every sublevel set of the loss function for certain feedforward neural network architectures (assuming activation functions that are homeomorphisms R-->R) is connected. I'll define all the terms I mentioned in the previous sentence, and will spend much of my time putting the result in context.

10/20/20: Explainability Methods for Neural Networks, Cihan Soylu

Abstract: Neural networks became an important machine learning tool for achieving human-level performance for many learning tasks. However, due to the black-box nature of these models, it is difficult to understand which features of a given input is causing the decision of the learned network. This understanding is crucial for tasks such as medical diagnostics. In this talk, we will go over various explainability methods proposed for neural networks and ways to evaluate these methods.