Schedule

July 24, 2021 (Pacific Time)

9:00 - 9:05 Opening remarks

9:05 - 10:00 Invited Talk: Peter Bartlett (UC Berkeley)

Adversarial Examples in Random Deep Networks

10:00 - 10:55 Invited Talk: Misha Belkin (UCSD)

The Polyak-Lojasiewicz condition as a framework for over-parameterized optimization and its application to deep learning

10:55 - 11:40 Contributed talks

Distributional Generalization: A New Kind of Generalization

Preetum Nakkiran (Harvard University); Yamini Bansal (Harvard University)

Understanding the effect of sparsity on neural networks robustness

Lukas Timpl (TU Graz); Rahim Entezari (TU Graz); Hanie Sedghi (Google); Behnam Neyshabur (Google); Olga Saukh (TU Graz)

On the Generalization Improvement from Neural Network Pruning

Tian Jin (MIT); Gintare Karolina Dziugaite (ServiceNow); Daniel Roy (Vector Institute); Michael Carbin (MIT); Jonathan Frankle (MIT)

11:40 - 12:30 Poster Session

12:30 - 13:25 Invited Talk: Lenka Zdeborová (EPFL)

Overparameterization: Insights from solvable models

13:25 - 14:20 Invited talk: Andrea Montanari (Stanford)

The Generalization Behavior of Random Feature and Neural Tangent Models

14:20 - 15:05 Contributed talks

Towards understanding how momentum improves generalization in deep learning

Samy Jelassi (Princeton University); Yuanzhi Li (CMU)

Feature Learning in Infinite-Width Neural Networks

Greg Yang (Microsoft Research AI); Edward J. Hu (Microsoft Dynamics AI)

A Universal Law of Robustness via Isoperimetry

Sebastien Bubeck (Microsoft Research); Mark Sellke ()

15:05 - 15:55 Poster Session

15:55 - 16:50 Invited talk: Tengyuan Liang (University of Chicago)

Universal Prediction Band, Semidefinite Programming and Variance Interpolation

16:50 - 17:45 Invited talk: Suriya Gunasekar (Microsoft Research)

Function space view of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

17:45 - 18:15 Contributed talks

Value-Based Deep Reinforcement Learning Requires Explicit Regularization"

Aviral Kumar (UC Berkeley); Rishabh Agarwal (Google Research, Brain Team); Aaron Courville (University of Montreal); Tengyu Ma (Stanford); George Tucker (Google Brain); Sergey Levine (UC Berkeley)

Beyond Implicit Regularization: Avoiding Overfitting via Regularizer Mirror Descent.

Navid Azizan (Stanford University); Ali Sahin Lale (California Institute of Technology); Babak Hassibi (California Institute of Technology)

18:15 - 18:20 Final remarks

Invited Talk Abstracts

Peter Bartlett (UC Berkeley), Adversarial Examples in Random Deep Networks

Because the phenomenon of adversarial examples in deep networks poses a serious barrier to the reliable and robust application of this methodology, there has been considerable interest in why it arises. We consider ReLU networks of constant depth with independent gaussian parameters, and show that small perturbations of input vectors lead to large changes of outputs. Building on insights of Daniely and Schacham (2020) and of Bubeck et al (2021), we show that adversarial examples arise in these networks because the functions that they compute are very close to linear. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind: there are suitably deep networks that, with constant probability, compute a function that is close to constant. This, combined with results characterizing benign overfitting in linear regression, suggests two potential mechanisms behind adversarial examples in overparameterized settings, one arising from label noise and the other from random initialization.

Joint work with Sébastien Bubeck and Yeshwanth Cherapanamjeri.

Misha Belkin (UCSD), The Polyak-Lojasiewicz condition as a framework for over-parameterized optimization and its application to deep learning

The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this talk I will discuss some general mathematical principles allowing for efficient optimization in over-parameterized non-linear systems, a setting that includes deep neural networks. I will discuss that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD. I will connect the PL condition of these systems to the condition number associated to the tangent kernel and show how a non-linear theory for those systems parallels classical analyses of over-parameterized linear equations. As a separate related development, I will discuss a perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network controlled by certain functional norms. Combining these ideas, I will show how the transition to linearity can be used to demonstrate the PL condition and convergence for a general class of wide neural networks. Finally I will comment on systems which are ''almost'' over-parameterized, which appears to be common in practice.

Based on joint work with Chaoyue Liu and Libin Zhu.

Lenka Zdeborová (EPFL), Overparameterization: Insights from solvable models

This talk gives an overview of recent results in a line of theoretical work that started 3 decades ago in statistical physics. We will first discuss teacher-student setting of the generalized linear regression. We illustrate the presence of the interpolation peak for classification with ridge loss and its vanishing with regularization. We show that, in the spherical perceptron, the optimally regularized logistic regression approaches very closely the Bayes optimal accuracy. We contrast this with the non-convex case of phase retrieval where the canonical empirical risk minimization performs poorly compared to the Bayes-optimal error. We then move towards learning with hidden units and analyze double descent in learning with generic fixed features and any convex loss. The formulas we obtain are generic enough to describe the learning of the last layer of neural networks for realistic data and networks. Finally, for the phase retrieval, we are able to analyze gradient descent in the feature-learning regime of a two-layer neural network where we show that overparametrization allows a considerable reduction of the sample complexity. Concretely, an overparametrized neural network only needs twice the input dimension of samples, while non-overparametrized network needs constant times more, and kernel regression quadratically many samples in the input dimension.

Andrea Montanari (Stanford), The Generalization Behavior of Random Feature and Neural Tangent Models

I consider two layer neural networks trained with square loss in the linear (lazy) regime. Under overparametrization, gradient descent converges to the minimum norm interpolant, and I consider this as well as the whole ridge regularization path. From a statistical viewpoint, these approaches are random features models, albeit of a special type. They are also equivalent to kernel ridge regression, with a random kernel of rank N*d (where N is the number of hidden neurons, and d the input dimension). I will describe a precise characterization of the generalization error when N, d and the sample size are polynomially related (and for covariates that are uniform on the d-dimensional sphere). I will then discuss the limitation of these approaches. I will explain how sparse random feature models can be learnt efficiently to try to address these limitations.

Based on joint work with Michael Celentano, Song Mei, Theodor Misiakiewicz, Yiqiao Zhong.

Tengyuan Liang (University of Chicago), Universal Prediction Band, Semidefinite Programming and Variance Interpolation

A frequent criticism from the statistics community to modern machine learning is the lack of rigorous uncertainty quantification. Instead, the machine learning community would argue that conventional uncertainty quantification based on idealized distributional assumptions may be too restrictive for real data. This paper will make progress in resolving the above inference dilemma. We propose a computationally efficient method to construct nonparametric, heteroskedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, with strong non-asymptotic coverage properties, and easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.

Suriya Gunasekar (Microsoft Research), Function space view of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

The magnitude of the weights of a neural network is a fundamental measure of complexity that plays a crucial role in the study of implicit and explicit regularization. For example, in recent work, gradient descent updates in overparameterized models asymptotically lead to solutions that implicitly minimize the ell_2 norm of the parameters of the model, resulting in an inductive bias that is highly architecture dependent. To investigate the properties of learned functions, it is natural to consider a function space view given by the minimum ell_2 norm of weights required to realize a given function with a given network. We call this the “induced regularizer” of the network. Building on a line of recent work, we study the induced regularizer of linear convolutional neural networks with a focus on the role of kernel size and the number of channels. We introduce an SDP relaxation of the induced regularizer, that we show is tight for networks with a single input channel. Using this SDP formulation, we show that the induced regularizer is independent of the number of the output channels for single-input channel networks, and for multi-input channel networks, we show independence given sufficiently many output channels. Moreover, we show that as the kernel size increases, the induced regularizer interpolates between a basis-invariant norm and a basis-dependent norm that promotes sparse structures in Fourier space.

Based on joint work with Meena Jagadeesan and Ilya Razenshteyn.