Schedule

8:30 - 8:40 : Opening remarks

8:40 - 9:10 : Invited talk: Andrea Montanari, Linearized two-layers neural networks in high dimension [Video/Slides]

9:10 - 9:40 : Invited talk: Lenka Zdeborova, Loss landscape and behaviour of algorithms in the spiked matrix-tensor model [Video/Slides]

9:40 - 10:20 : Poster spotlights [Video/Slides, Slides only]

10:20 - 11:00 : Break and poster discussion

11:00 - 11:30 : Invited talk: Kyle Cranmer, On the Interplay between Physics and Deep Learning [Video/Slides]

11:30 - 12:00 : Invited talk: Michael Mahoney, Why Deep Learning Works: Traditional and Heavy-Tailed Implicit Self-Regularization in Deep Neural Networks [Video/Slides]

12:00 - 12:15 : Contributed talk: SGD dynamics for two-layer neural networks in the teacher-student setup [Video/Slides]

12:15 - 12:30 : Contributed talk: Convergence Properties of Neural Networks on Separable Data [Video/Slides]

12:30 - 14:00 : Lunch

14:00 - 14:30 : Invited talk: Sanjeev Arora, Is Optimization a sufficient language to understand Deep Learning? [Video/Slides]

14:30 - 14:45 : Contributed talk: Towards Understanding Regularization in Batch Normalization [Video/Slides]

14:45 - 15:00 : Contributed talk: How Noise during Training Affects the Hessian Spectrum [Video/Slides]

15:00 - 15:30 : Break and poster discussion

15:30 - 16:00 : Invited talk: Jascha Sohl-Dickstein, Understanding overparameterized neural networks [Video/Slides]

16:00 - 16:15 : Contributed talk: Asymptotics of Wide Networks from Feynman Diagrams [Video/Slides]

16:15 - 16:30 : Contributed talk: A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off [Video/Slides]

16:30 - 16:45 : Contributed talk: Deep Learning on the 2-Dimensional Ising Model to Extract the Crossover Region [Video/Slides]

16:45 - 17:00 : Contributed talk: Learning the Arrow of Time [Video/Slides]

17:00 - 18:00 : Poster session

Poster spotlights

[Video/Slides]

A Quantum Field Theory of Representation Learning
Covariance in Physics and Convolutional Neural Networks
Scale Steerable Filters for Locally Scale-Invariant Convolutional Neural Networks
Towards a Definition of Disentangled Representations
Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
Finite size corrections for neural network Gaussian processes
Pathological Spectrum of the Fisher Information Matrix in Deep Neural Networks
Inferring the quantum density matrix with machine learning
Jet grooming through reinforcement learning

Invited Talk Abstracts

Andrea Montanari, Linearized two-layers neural networks in high dimension

We consider the problem of learning an unknown function f on the d-dimensional sphere with respect to the square loss, given i.i.d. samples (y_i,x_i) where x_i is a feature vector uniformly distributed on the sphere and y_i = f(x_i). We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: (RF) The random feature model of Rahimi- Recht; (NT) The neural tangent kernel model of Jacot-Gabriel-Hongler. Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and hence enjoy universal approximation properties when the number of neurons N diverges, for a fixed dimension d. We prove that, if both d and N are large, the behavior of these models is instead remarkably simpler. If N is of smaller order than d^2, then RF performs no better than linear regression with respect to the raw features x_i, and NT performs no better than linear regression with respect to degree-one and two monomials in the x_i's. More generally, if N is of smaller order than d^{k+1} then RF fits at most a degree-k polynomial in the raw features, and NT fits at most a degree-(k+ 1) polynomial. We then focus on the case of quadratic functions, and N= O(d). We show that the gap in generalization error between fully trained neural networks and the linearized models is potentially unbounded. [Based on joint work with Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz]

Lenka Zdeborova, Loss landscape and behaviour of algorithms in the spiked matrix-tensor model

A key question of current interest is: How are properties of optimization and sampling algorithms influenced by the properties of the loss function in noisy high-dimensional non-convex settings? Answering this question for deep neural networks is a landmark goal of many ongoing works. In this talk I will answer this question in unprecedented detail for the spiked matrix-tensor model. Information theoretic limits, and Kac-Rice analysis of the loss landscapes, will be compared to the analytically studied performance of message passing algorithms, of the Langevin dynamics and of the gradient flow. Several rather non-intuitive results will be unveiled and explained.

Kyle Cranmer, On the Interplay between Physics and Deep Learning

The interplay between physics and deep learning is typically divided into two themes. The first is “physics for deep learning”, where techniques from physics are brought to bear on understanding dynamics of learning. The second is “deep learning for physics,” which focuses on application of deep learning techniques to physics problems. I will present a more nuanced view of this interplay with examples of how the structure of physics problems have inspired advances in deep learning and how it yields insights on topics such as inductive bias, interpretability, and causality.

Michael Mahoney, Why Deep Learning Works: Traditional and Heavy-Tailed Implicit Self-Regularization in Deep Neural Networks

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Coupled with work on energy landscapes and heavy-tailed spin glasses, it also suggests an explanation of why deep learning works. Joint work with Charles Martin of Calculation Consulting, Inc.

Sanjeev Arora, Is Optimization a sufficient language to understand Deep Learning?

There is an old debate in neuroscience about whether or not learning has to boil down to optimizing a single cost function. This talk will suggest that even to understand mathematical properties of deep learning, we have to go beyond the conventional view of "optimizing a single cost function". The reason is that phenomena occur along the gradient descent trajectory that are not fully captured in the value of the cost function. I will illustrate briefly with three new results that involve such phenomena:

(i) (joint work with Cohen, Hu, and Luo) How deep matrix factorization solves matrix completion better than classical algorithms https://arxiv.org/abs/1905.13655

(ii) (joint with Du, Hu, Li, Salakhutdinov, and Wang) How to compute (exactly) with an infinitely wide net ("mean field limit", in physics terms) https://arxiv.org/abs/1904.11955

(iii) (joint with Kuditipudi, Wang, Hu, Lee, Zhang, Li, Ge) Explaining mode-connectivity for real-life deep nets (the phenomenon that low-cost solutions found by gradient descent are interconnected in the parameter space via low-cost paths; see Garipov et al'18 and Draxler et al'18)

Jascha Sohl-Dickstein, Understanding overparameterized neural networks

As neural networks become highly overparameterized, their accuracy improves, and their behavior becomes easier to analyze theoretically. I will give an introduction to a rapidly growing body of work which examines the learning dynamics and prior over functions induced by infinitely wide, randomly initialized, neural networks. Core results that I will discuss include: that the distribution over functions computed by a wide neural network often corresponds to a Gaussian process with a particular compositional kernel, both before and after training; that the predictions of wide neural networks are linear in their parameters throughout training; and that this perspective enables analytic predictions for how trainability depends on hyperparameters and architecture. These results provide for surprising capabilities -- for instance, the evaluation of test set predictions which would come from an infinitely wide trained neural network without ever instantiating a neural network, or the rapid training of 10,000+ layer convolutional networks. I will argue that this growing understanding of neural networks in the limit of infinite width is foundational for future theoretical and practical understanding of deep learning.

Google Sites

Report abuse