Invited Speakers

Bren Professor of Computer & Mathematical Sciences, California Institute of Technology

Enabling Zero-Shot Generalization in AI4Science

Abstract: AI holds immense promise in enabling scientific breakthroughs and discoveries in diverse areas. However, in most scenarios this is not a standard supervised learning framework. AI4science often requires zero-shot generalization to entirely new scenarios not seen during training. For instance, drug discovery requires predicting properties of new molecules that can vastly differ from training data, and AI-based PDE solvers require solving any instance of the PDE family. Such zero-shot generalization requires infusing domain knowledge and structure. I will present recent success stories in using AI to obtain 1000x speedups in solving PDEs and quantum chemistry calculations.

Principal Member at the Computer Science Research Institute at Sandia National Laboratories

A Layer-Parallel Approach for Training Deep Neural Networks

Abstract: Deep neural networks are a powerful machine learning tool with the capacity to “learn” complex nonlinear relationships described by large data sets. Despite their success training these models remains a challenging and computationally intensive undertaking. In this talk we will present a layer-parallel training algorithm that exploits a multigrid scheme to accelerate both forward and backward propagation. Introducing a parallel decomposition between layers requires inexact propagation of the neural network. The multigrid method used in this approach stitches these subdomains together with sufficient accuracy to ensure rapid convergence. We demonstrate an order of magnitude wall-clock time speedup over the serial approach, opening a new avenue for parallelism that is complementary to existing approaches. In a more recent development, we discuss applying the layer-parallel methodology to recurrent neural networks. In particular, we study the generalized recurrent unit (GRU) architecture. We demonstrate its relation to a simple ODE formulation that facilitates application of the layer-parallel approach. Results are demonstrating performance improvements on a human activity recognition (HAR) data set are presented.

The Walter E. Koss Professor and a Distinguished Professor of Mathematics at Texas A&M University

Thoughts on Deep Learning

Abstract: This talk will take a mathematical view of the problem of learning a function from data and then examine how this view fits into our current knowledge of deep learning.

Associate professor of Computer Science at Columbia University

Computational Lower Bounds for Tensor PCA

Abstract: Tensor PCA is a model statistical inference problem introduced by Montanari and Richard in 2014 for studying method-of-moments approaches to parameter estimation in latent variable models. Unlike the matrix counterpart of the problem, Tensor PCA exhibits a computational-statistical gap in the sample-size regime where the problem is information-theoretically solvable but no computationally efficient algorithm is known. I will describe unconditional computational lower bounds on classes of algorithms for solving Tensor PCA that shed light on limitations of commonly-used solution approaches, including gradient descent and power iteration, as well as the role of overparameterization. This talk is based on joint work with Rishabh Dudeja.

The Charles Pitts Robinson and John Palmer Barstow Professor of Applied Mathematics and Engineering, Brown University

Approximating Functions, Functionals, and Operators Using Deep Neural Networks for Diverse Applications

Abstract: We will present a new approach to develop a data-driven, learning-based framework for predicting outcomes of physical systems and for discovering hidden physics from noisy data. We will introduce a deep learning approach based on neural networks (NNs) and generative adversarial networks (GANs). Unlike other approaches that rely on big data, here we “learn” from small data by exploiting the information provided by the physical conservation laws, which are used to obtain informative priors or regularize the neural networks. We will demonstrate the power of PINNs for several inverse problems, and we will demonstrate how we can use multi-fidelity modeling in monitoring ocean acidification levels in the Massachusetts Bay. We will also introduce new NNs that learn functionals and nonlinear operators from functions and corresponding responses for system identification. The universal approximation theorem of operators is suggestive of the potential of NNs in learning from scattered data any continuous operator or complex system. We first generalize the theorem to deep neural networks, and subsequently we apply it to design a new composite NN with small generalization error, the deep operator network (DeepONet), consisting of a NN for encoding the discrete input function space (branch net) and another NN for encoding the domain of the output functions (trunk net). We demonstrate that DeepONet can learn various explicit operators, e.g., integrals, Laplace transforms and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations. More generally, DeepOnet can learn multiscale operators spanning across many scales and trained by diverse sources of data simultaneously.

Professor of Department of Applied Mathematics & School of Mech. Eng., Director of Data Science Consulting Service, Purdue University

Scalable Algorithms for Bayesian Deep Learning via Stochastic Gradient Monte Carlo and Beyond

Abstract: Replica exchange Monte Carlo (reMC), also known as parallel tempering, is an important technique for accelerating the convergence of the conventional Markov Chain Monte Carlo (MCMC) algorithms. However, such a method requires the evaluation of the energy function based on the full dataset and is not scalable to big data. The naïve implementation of reMC in mini-batch settings introduces large biases, which cannot be directly extended to the stochastic gradient MCMC (SGMCMC), the standard sampling method for simulating from deep neural networks (DNNs). In this paper, we propose an adaptive replica exchange SGMCMC (reSGMCMC) to automatically correct the bias and study the corresponding properties. The analysis implies an acceleration-accuracy trade-off in the numerical discretization of a Markov jump process in a stochastic environment. Empirically, we test the algorithm through extensive experiments on various setups and obtain the state-of-the-art results on CIFAR10, CIFAR100, and SVHN in both supervised learning and semi-supervised learning tasks.

Associate Professor of Computer Science, Statistics, and Computational and Applied Mathematics, The University of Chicago

Equivariance and Fourier Space Neural Networks: TBA

Abstract: Equivariance to the action of symmetry groups has emerged as one of the fundamental design principles of neural networks for modeling physical systems. In this talk I will review this emerging field, highlighting how some of the same machinery can be used across the board from equivariance to continuous Lie groups such as the Euclidean group of rotations and translations to finite groups such as the symmetric group acting on the vertices of a graph. Specifically, introducing the formalism of noncommutative Fourier transforms, I will argue for neural networks that operate in the Fourier domain (with respect to the symmetry group in question) and show how certain concepts from representation theory allow us to implement both the equivariant linear operations and nonlinearities in Fourier space.

Assistant professor of Electrical and Computer Engineering and Computer Science at Princeton University

On the Foundations of Deep Learning: SGD, Overparametrization, and Generalization

Abstract: We provide new results on the effectiveness of SGD and overparametrization in deep learning.

a) SGD: We show that SGD converges to stationary points for general nonsmooth , nonconvex functions, and that stochastic subgradients can be efficiently computed via Automatic Differentiation. For smooth functions, we show that gradient descent, coordinate descent, ADMM, and many other algorithms, avoid saddle points and converge to local minimizers. For a large family of problems including matrix completion and shallow ReLU networks, this guarantees that gradient descent converges to a global minimum.

b) Overparametrization: We show that gradient descent finds global minimizers of the training loss of overparametrized deep networks in polynomial time.

c) Generalization: For general neural networks, we establish a margin-based theory. The minimizer of the cross-entropy loss with weak regularization is a max-margin predictor, and enjoys stronger generalization guarantees as the amount of overparametrization increases.

d) Algorithmic and Implicit Regularization: We analyze the implicit regularization effects of various optimization algorithms on overparametrized networks. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence. For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For over-parametrized non-convex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees

Professor of Mathematics, Chemistry, and Physics at Duke University

A Priori Error Analysis for Solving High-Dimensional PDEs based on Neural Networks

Abstract: Numerical solution to high dimensional PDEs has been one of the central challenges in scientific computing due to curse of dimension. In recent years, we have seen tremendous progress in applying neural networks to solve high-dimensional PDEs, while the analysis for such methods is still lacking. In this talk, we will discuss some of these numerical methods for high dimensional PDEs, challenges and also some initial attempts in numerical analysis for high dimensional elliptic PDEs and eigenvalue problems (based on joint works with Ziang Chen, Yulong Lu, and Min Wang).

Professor of Electrical Engineering, Statistics, and Mathematics at Stanford University

Minimum Complexity Interpolation in Random Features Models

Abstract: Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in R^d, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm--which is equivalent to a weighted \ell_2 norm -- is replaced by a weighted functional \ell_p norm, which we refer to as F_p norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinite-dimensional convex problem. We study random features approximations to these norms and show that, for p>1, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with F_p norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. Joint work with Michael Celentano and Theodor Misiakiewicz.

Assistant Professor of Mathematics and Statistics, University of California, Los Angeles

Geometry of Linear Convolutional Networks

Abstract: We study the geometry of linear convolutional neural networks. Whereas the sets of functions represented by dense linear networks can be characterized by rank constraints, the restricted connectivity and shared weights in convolutional layers can give rise to inequality constraints among the entries of the represented matrices. We investigate the optimization of an objective function over such constrained sets of functions, taking a look at the number of critical points, critical points induced by the parametrization, and invariants of gradient flow. This is joint work with Kathlen Kohn, Thomas Merkh, Matthew Trager.

Tan Chin Tuan Centennial Professor of Mathematics at the National University of Singapore

Deep Approximation via Deep Learning

Abstract: The primary task of many applications is approximating/estimating a function through samples drawn from a probability distribution on the input space. The deep approximation is to approximate a function by compositions of many layers of simple functions, that can be viewed as a series of nested feature extractors. The key idea of deep learning network is to convert layers of compositions to layers of tuneable parameters that can be adjusted through a learning process, so that it achieves a good approximation with respect to the input data. In this talk, we shall discuss mathematical theory behind this new approach and approximation rate of deep network; we will also show how this new approach differs from the classic approximation theory, and how this new theory can be used to understand and design deep learning network.

Associate Professor of Electrical Engineering at Columbia University

Deep Networks and the Multiple Manifold Problem

Abstract: Data with low-dimensional nonlinear structure are ubiquitous in engineering and scientific problems. We study a model problem with such structure—a binary classification task that uses a deep fully-connected neural network to classify data drawn from two disjoint smooth curves on the unit sphere. Aside from mild regularity conditions, we place no restrictions on the configuration of the curves. We prove that when (i) the network depth is large relative to certain geometric properties that set the difficulty of the problem and (ii) the network width and number of samples is polynomial in the depth, randomly-initialized gradient descent quickly learns to correctly classify all points on the two curves with high probability. To our knowledge, this is the first generalization guarantee for deep networks with nonlinear data that depends only on intrinsic data properties. Our analysis draws on ideas from harmonic analysis and martingale concentration for handling statistical dependencies in the initial (random) network. We sketch applications to invariant vision, and to gravitational wave astronomy, where leveraging low-dimensional structure leads to statistically optimal tests for identifying signals in noise. Joint work with Sam Buchanan, Dar Gilboa, Tim Wang, Jingkai Yan

The Verne M. Willaman Professor Mathematics at the Pennsylvania State University

Training and analysis of numerical PDE by neural networks

Abstract: Recently, deep neural networks have dramatically improved the state-of-the-art in difficult machine learning tasks, most notably computer vision and natural language processing. This has generated a lot of interest in using deep neural networks for other tasks, such as scientific computing. A major emerging research problem is to determine whether and how deep neural networks improve upon traditional approaches, such as finite element methods for problems in scientific computing. When DNN is used for solving partial differential equations (PDE), stochastic gradient descent (SGD) methods (and variants) have been naturally applied for the underlying constrained optimization problem in most existing literature. Not to mention the lack of theoretical justification, one big drawback of such type of optimization algorithms is its poor accuracy and it is extremely challenging to use SGD type algorithms to obtain numerical solutions for which any reasonable asymptotic convergence rate can be empirically observed. In this talk, we show that the constrained optimization problem in numerical PDE by neural networks can be efficiently solved using a class of greedy algorithms, instead of stochastic gradient descent. The error arising from discretizing the energy integrals is bounded both in the deterministic case and also in the stochastic case. We provide an a priori error analysis for methods solving PDEs using neural networks. The innovative greedy algorithm is tested on several benchmark examples in both 1D and 2D confirming the error analysis.

Professor of Mathematics at Stanford University

The Sobolev Regularization Effect of Stochastic Gradient Descent

Abstract: The multiplicative structure of parameters and input data in the first layer of neural networks is explored to build connection between the landscape of the loss function with respect to parameters and the landscape of the model function with respect to input data. By this connection, it is shown that flat minima regularize the gradient of the model function, which explains the good generalization performance of flat minima. Then, we go beyond the flatness and consider high-order moments of the gradient noise, and show that Stochastic Gradient Descent (SGD) tends to impose constraints on these moments by a linear stability analysis of SGD around global minima. Together with the multiplicative structure, we identify the Sobolev regularization effect of SGD, i.e. SGD regularizes the Sobolev seminorms of the model function with respect to the input data. Finally, bounds for generalization error and adversarial robustness are provided for solutions found by SGD under assumptions of the data distribution. Joint work with Chao MA.

Chancellor's Professor of Statistics, Electrical Engineering, and Computer Sciences at the University of California, Berkeley

Interpreting Deep Neural Networks

Abstract: Recent deep learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. We will discuss our recent works aiming to interpret neural networks by attributing importance to features and feature interactions for individual predictions. Importantly, the proposed method (named agglomerative contextual decomposition, or ACD) disentangles the importance of features in isolation and the interactions between groups of features. These attributions yield insights across domains, including in NLP/computer vision and can be used to directly improve generalization in interesting ways.


We focus on scientific machine learning problems. In cosmology, it is crucial to interpret how a model trained on simulations predicts fundamental cosmological parameters. By extending ACD to interpret transformations of input features, we vet the model by analyzing attributions in the frequency domain. Furthermore, we have recently developed a method of adaptive wavelet distillation (AWD) that can be both predictive (possibly with improvements over DNNs) and interpretable for cosmological parameter prediction. If time permits, we will demonstrate how AWD works similarly in a molecular partner prediction problem in cell biology.


Paper links: hierarchical interpretations (ICLR 2019), interpreting transformations in cosmology (ICLR workshop 2020), penalizing explanations (ICML 2020)