Nov. 22nd - 23rd, 2021

The 2nd Workshop on Seeking Low-dimensionality in Deep Neural Networks (SLowDNN)

The recorded talks will be uploaded to YouTube soon. In the meanwhile, you can find most of them via the link: https://mediaspace.msu.edu/media/t/1_gk1dmedc, https://mediaspace.msu.edu/media/t/1_upd7be8m

Overview

The resurgence of deep neural networks has led to revolutionary success across almost all the areas in engineering and science. However, despite recent endeavors, the underlying principles behind its success still remain a mystery. On the other hand, the connections between deep neural networks and low dimensional models emerge at multiple levels:

The structural connection between a deep neural network and a sparsifying algorithm has been well observed and acknowledged in the literature, which has also transformed the ways that we are solving inverse problems with intrinsic low-dimensionality.
Low-dimensional modeling has recently been shown as a commonly used testbed for understanding generalization, (implicit) regularization, expressivity, and robustness in over-parameterized deep learning models. For example, the learned representations of deep networks often possess certain benign low-dimensional structures, leading to better generalization and robustness.
Various theoretical and numerical evidence supports that enforcing certain isometry properties within the network often leads to improved performance for both training, generalization, and robustness.
Low-dimensional priors learned through deep networks demonstrated significantly improved performances over traditional methods in signal processing and machine learning.

Given these exciting, while less exploited connections, this two-day workshop aims to bring together experts in machine learning, applied mathematics, signal processing, and optimization, and share recent progress, and foster collaborations on mathematical foundations of deep learning. We would like to stimulate vibrate discussions towards bridging the gap between the theory and practice of deep learning by developing a more principled and unified mathematical framework based on the theory and methods for learning low-dimensional models in high-dimensional space.

Invited Speakers

Joan Bruna

NYU, CS

David Donoho

Stanford, Stats

Yonina Eldar

Weizmann Institute of Science

Daniel Hsu

Columbia, CS

Paul Hand

Northeastern, Math

Sijia Liu

Michigan State, CS

Wenjing Liao

Georgia Tech, Math

Stéphane Mallat

Collège de France

Yi Ma

UC Berkeley, EECS

Mert Pilanci

Stanford, EE

Saiprasad Ravishankar

Michigan State, CMSE

Dror Simon

Technion, CS

Yuting Wei

UPenn, Stats

Zhihui Zhu

Denver, ECE

Tentative Schedule (Subject to Change)

All time based on Eastern Time (EST)

Day 1 (Monday, Nov. 22nd)

Session 1 (8:30 am - 12:40 pm, EST) - Moderator: Qing Qu

8:50 am - 9:00 am

Opening Remark

9:00 am - 10:00 am

On the Approximation Power of Two-layer Networks of Random ReLUs

How well can depth-two ReLU networks with random bottom-level weights represent simple functions? We give near-matching upper- and lower-bounds for $L_2$-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms. Our positive results employ tools from harmonic analysis and ridgelet representation theory, while our lower-bounds are based on (robust versions of) dimensionality arguments. Joint work with Clayton Sanford, Rocco Servedio, and Emmanouil-Vasileios Vlatakis-Gkaragkounis.

Daniel Hsu

(Columbia, CS)

10:00 am - 10:40 am

Regression and Doubly Robust Off-policy Learning on Low-dimensional Manifolds by Neural Networks

Many data in real-world applications are in a high-dimensional space but exhibit low-dimensional structures. In mathematics, these data can be modeled as random samples on a low-dimensional manifold. Our goal is to estimate a target function or learn an optimal policy using neural networks. This talk is based on an efficient approximation theory of deep ReLU networks for functions supported on a low-dimensional manifold. We further establish the sample complexity for regression and off-policy learning with finite samples of data. When data are sampled on a low-dimensional manifold, the sample complexity crucially depends on the intrinsic dimension of the manifold instead of the ambient dimension of the data. These results demonstrate that deep neural networks are adaptive to low-dimensional geometric structures of data sets. This is a joint work with Minshuo Chen, Haoming Jiang, Hao Liu, Tuo Zhao at Georgia Institute of Technology.

Wenjing Liao

(Georgia Tech, Math)

10:40 am - 11:00 am

Coffee Break

11:00 am - 12:00 pm

On the Role of Data Structure in High-dimensional Learning

High-dimensional learning remains an outstanding task where empirical successes often coexist alongside mathematical and statistical curses. In this talk, we will describe two vignettes of this tension that underscore the importance of distributional assumptions. First, we will describe the role of invariance and symmetry priors in a non-parametric learning setup, by studying the gains in sample complexity brought by incorporating these priors into the learning model. Next, we will describe the role of data structure on the computational side, by studying computational-to-statistical gaps arising in the seemingly simple problem of learning a single neuron.

Joint work with Alberto Bietti and Luca Venturi (first part), and Min Jae Song and Ilias Zadik (second part).

Joan Bruna

(NYU, CS & CDS)

12:00 pm - 12:40 pm

Advancing Algorithmic Foundation of Robust Deep Learning Through The Lens of Bi-Level Optimization

Adversarial training (AT) has become a widely recognized defense mechanism to improve the robustness of deep neural networks against adversarial attacks. However, it is often difficult to scale and is tied to a fixed attack-defense setup. Beyond AT, this talk will foster a technological breakthrough for robust deep learning (DL) through the lens of bi-level optimization (BLO). First, I will show that BLO covers AT as a special case. One can even prove that AT is equivalent to the linearized BLO along the direction given by the sign of input gradient. Second, with the aid of BLO, I will introduce a new systematic, theoretically-grounded, and scalable AT framework, termed bi-level AT (BAT), built upon implicit gradient theory. In contrast to AT, it has the least restriction to attack-defense setup and accuracy-robustness tradeoff. Third, I will empirically demonstrate the effectiveness of BAT in tackling various image classification tasks. Lastly, I will conclude this talk with discussion and open challenges remained in the field.

Sijia Liu

(Michigan State, CSE)

Lunch Break (12:40 pm - 1:30 pm, EST)

Session 2 (1:30 pm - 5:30 pm, EST) - Moderator: Chong You

1:30 pm - 2:30 pm

Closed-Loop Data Transcription via Minimaxing Rate Reduction

This work proposes a new computational framework for learning an explicit generative model for real-world datasets. More specifically, we propose to learn a closed-loop transcription between a multi-class multi-dimensional data distribution and a linear discriminative representation (LDR) in the feature space that consists of multiple independent linear subspaces. We argue that the optimal encoding and decoding mappings sought can be formulated as the equilibrium point of a two-player minimax game between the encoder and decoder. A natural utility function for this game is the so-called rate reduction, a simple information-theoretic measure for distances between mixtures of subspace-like Gaussians in the feature space. Our formulation draws inspiration from closed-loop error feedback from control systems and avoids expensive evaluating and minimizing approximated distances between arbitrary distributions in either the data space or the feature space. To a large extent, this new formulation unifies the concepts and benefits of Auto-Encoding and GAN and naturally extends them to the settings of learning a both discriminative and generative representation for multi-class and multi-dimensional real-world data. Our extensive experiments on many benchmark imagery datasets demonstrate tremendous potential of this new closed-loop formulation: we notice that the so learned features of different classes are explicitly mapped onto approximately independent principal subspaces in the feature space; and diverse visual attributes within each class are modeled by the independent principal components within each subspace. This work opens many deep mathematical problems regarding learning submanifolds in high-dimensional spaces as well as suggests potential computational mechanisms about how memory can be formed through a purely internal close-loop process.

Yi Ma

(UC Berkeley, EECS)

2:30 pm - 3:10 pm

Signal Recovery with Generative Priors

Recovering images from very few measurements is an important task in imaging problems. Doing so requires assuming a model of what makes some images natural. Such a model is called an image prior. Classical priors such as sparsity have led to the speedup of Magnetic Resonance Imaging in certain cases. With the recent developments in machine learning, neural networks have been shown to provide efficient and effective priors for inverse problems arising in imaging. In this talk, we will discuss the use of neural network generative models for inverse problems in imaging. We will present a rigorous recovery guarantee at optimal sample complexity for compressed sensing and other inverse problems under a suitable random model. We will see that generative models enable an efficient algorithms for phase retrieval and spiked matrix recovery from generic measurements with optimal sample complexity. In contrast, no efficient algorithm is known for these problem in the case of sparsity priors. We will discuss strengths, weaknesses, and future opportunities of neural networks and generative models as image priors. These works are in collaboration with Vladislav Voroninski, Reinhard Heckel, Ali Ahmed, Wen Huang, Oscar Leong, Jorio Cocola, Muhammad Asim, and Max Daniels.

Paul Hand

(NEU, Math)

3:10 pm - 3:30 pm

Coffee Break

3:30 pm - 4:30 pm

Panel Discussion

Panelists: Yi Ma, Joan Bruna, Saiprasad Ravishankar, John Wright

Panel Moderator:

Jeremias Sulam

(JHU, BME)

Day 2 (Tuesday, Nov. 23rd)

Session 3 (9:00 am - 12:40 pm, EST) - Moderator: Zhihui Zhu

9:00 am - 10:00 am

Model Based Deep Learning: Applications to Imaging and Communications

Deep neural networks provide unprecedented performance gains in many real-world problems in signal and image processing. Despite these gains, the future development and practical deployment of deep networks are hindered by their black-box nature, i.e., a lack of interpretability and the need for very large training sets. On the other hand, signal processing and communications have traditionally relied on classical statistical modeling techniques that utilize mathematical formulations representing the underlying physics, prior information and additional domain knowledge. Simple classical models are useful but sensitive to inaccuracies and may lead to poor performance when real systems display complex or dynamic behavior. Here we introduce various approaches to model based learning which merge parametric models with optimization tools leading to efficient, interpretable networks from reasonably sized training sets. We will consider examples of such model-based deep networks to image deblurring, image separation, super resolution in ultrasound and microscopy, efficient communications systems, and finally we will see how model-based methods can also be used for efficient diagnosis of COVID19 using X-ray and ultrasound.

Yonina Eldar

(Weizmann Institute of Science)

10:00 am - 11:00 am

Concentration with Renormalisation Group and Convolution Nets

Deep neural network performances seem to rely on concentration phenomena, in supervised and unsupervised applications. Estimating the Gibbs energy of a high-dimensional probability distributions is unstable, specially near phase transitions. We revisit this topic with low-dimensional models resulting from multiscale harmonic analysis and neural networks. We show that renormalisation group calculations in wavelet bases amounts to precondition the estimation. Stable estimations are shown on the phi^4 model and weak lensing Cosmological data. Multiscale models of turbulences are also computed. It amounts to implement a deep network, whose filters are wavelets. Similar results are obtained for image classification. ResNet accuracy is reached on ImageNet with a deep network with wavelet filters, by only learning 1x1 convolutional kernels across channels.

Stéphane Mallat

(Collège de France)

11:00 am - 11:10 am

Coffee Break

11:10 am - 11:50 am

When and How can Deep Generative Models be Inverted?

Recently, deep generative models have been used as signal priors for solving various inverse problems. In this talk, we aim to study the invertibility conditions of such models and derive practical and theoretically certified algorithms for such an inversion. Building upon sparse representation theory, we define conditions that rely on the cardinalities of the hidden layers and introduce several inversion pursuit algorithms for inverting generative networks of arbitrary depth. This is a joint work with Aviad Aberdam and Michael Elad.

Dror Simon

(Technion, CS)

12:00 pm - 12:40 pm

Learning Regularizers for Inverse Problems

In this talk, we present two approaches to learning regularizers for inverse problems in imaging, particularly sparsity-based and deep network-based regularizers.

First, we present a method for supervised learning of sparsity-promoting (\ell_1 norm) regularizers, where the parameters of the regularizer are learned to minimize reconstruction error on a paired training set. Training involves a challenging bilevel optimization problem with a nonsmooth lower-level objective. We derive an expression for the gradient of the training loss using the implicit closed-form solution of the lower-level variational problem given by its dual problem, and provide an accompanying gradient descent algorithm (dubbed BLORC) to minimize the loss. Our experiments on 1D signals and natural images show that the gradient computation is efficient and the proposed method learns meaningful operators for signal reconstruction.

In the second part of the talk, we present an approach for unified supervised-unsupervised (SUPER) learning of regularizers that combines classical model-based image reconstruction (MBIR) optimization and unsupervised transform learning together with recent supervised deep learning in a common framework. For the unsupervised part, a union of sparsifying transforms is pre-learned to cluster CT image patches into multiple groups, with a specific transform well-matched to each group. We provide multiple interpretations of the proposed unified framework from a fixed point iteration analysis or bilevel training optimization standpoint. We show that the unified approach provides much better image reconstructions in low-dose X-ray computed tomography than the constitutent models, with limited training data.

Saiprasad Ravishankar

(Michigan State, CMSE)

Lunch Break (12:40 pm - 1:30 pm, EST)

Session 4 (1:30 pm - 5:10 pm, EST) - Moderator: Atlas Wang

1:30 pm - 2:30 pm

TBA

Dave Donoho

(Stanford, Stats)

2:30 pm - 3:10 pm

A Geometric Analysis of Neural Collapse with Unconstrained Features

Neural collapse is an intriguing empirical phenomenon that Papyan, Han, and Donoho recently discovered. This phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero.

In this talk, we will study the problem based on a simplified unconstrained feature model. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. We also exploit these findings to improve training efficiency: we can set the feature dimension equal to the number of classes and fix the last-layer classifier to be a Simplex ETF for network training, reducing memory cost by over 20% on ResNet18 without sacrificing the generalization performance. We will also discuss similar results for the mean squared error (MSE) loss. (Joint work with Tianyu Ding, Xiao Li, Qing Qu, Jeremias Sulam, Chong You, and Jinxin Zhou.)

Zhihui Zhu

(Denver, EE)

3:10 pm - 3:30 pm

Coffee Break

3:30 pm - 4:15 pm

Minimum L1-norm Interpolators: Precise Asymptotics and Multiple Descent

An evolving line of machine learning works observes empirical evidence that suggests interpolating estimators --- the ones that achieve zero training error --- may not necessarily be harmful. In this talk, we pursue theoretical understanding for an important type of interpolators: the minimum $\ell_{1}$-norm interpolator, which is motivated by the observation that several learning algorithms favor low $\ell_1$-norm solutions in the over-parameterized regime. Concretely, we consider the noisy sparse regression model under Gaussian design, focusing on linear sparsity and high-dimensional asymptotics (so that both the number of features and the sparsity level scale proportionally with the sample size).

We observe, and provide rigorous theoretical justification for, a curious \emph{multi-descent} phenomenon; that is, the generalization risk of the minimum $\ell_1$-norm interpolator undergoes multiple (and possibly more than two) phases of descent and ascent as one increases the model capacity. This phenomenon stems from the special structure of the minimum $\ell_1$-norm interpolator as well as the delicate interplay between the over-parameterized ratio and the sparsity, thus unveiling a fundamental distinction in geometry from the minimum $\ell_2$-norm interpolator. Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.

Yuting Wei

(UPenn, Stats)

4:15 pm - 5:00 pm

The Hidden Convex Optimization Landscape of Deep Neural Networks

The popularity of Deep Neural Networks (DNNs) continues to grow as a result of the great empirical success in a large number of machine learning tasks. However, despite their prevalence in machine learning and the dramatic surge of interest, there are major gaps in our understanding of the fundamentals of neural net models. Understanding the mechanism behind their extraordinary generalization properties remains an open problem. A significant challenge arises in the non-convexity of training DNNs. In non-convex optimization, the choice of optimization method and its internal parameters such as initialization, mini-batching and step sizes have a considerable effect on the quality of the learned model. This is in sharp contrast to convex optimization problems, where these optimization parameters have no effect, and globally optimal solutions can be obtained in a very robust, efficient, transparent and reproducible manner.

In this talk, we introduce exact convex optimization formulations of multilayer neural network training problems. We show that two and three layer neural networks with ReLU or polynomial activations can be globally trained via convex programs with the number of variables polynomial in the number of training samples and number of hidden neurons. Our results provide an equivalent characterization of neural networks as convex models where a mixture of locally linear models are fitted to the data with sparsity inducing convex regularization. Moreover, we show that certain standard two and three layer convolutional neural networks can be globally optimized in fully polynomial time. We discuss extensions to batch normalization and generative adversarial networks. Finally, we present numerical simulations verifying our claims and illustrating that standard local search heuristics such as stochastic gradient descent can be inefficient compared to the proposed convex program.

Mert Pilanci

(Stanford EE)

5:00 pm - 5:10 pm

Closing Remarks

Organizers

(University of Michigan)

(Michigan State University)

(Johns Hopkins University)

(University of Texas at Austin)

(University of Denver)

(Google Research)

(Carnegie Mellon University)

(University of California at Berkeley)

Co-Sponsors

More To be added