## Nov. 23rd - 24th, 2020

## Workshop on Seeking Low-dimensionality in Deep Neural Networks (SLowDNN)

**Attend: Zoom Meeting (link will be sent to registered participants)**

*RSVP by **Google Form*

### (The workshop has ended, please refer to https://www.youtube.com/playlist?list=PL7P834wcUN3eSh7Nn4wzPBjEnme4Rls7n for all the talk videos.)

# Overview

### The resurgence of deep neural networks has led to revolutionary success across almost all the areas in engineering and science. However, despite recent endeavors, the underlying principles behind its success still remain a mystery. On the other hand, the connections between deep neural network and low dimensional models emerge at multiple levels:

- The structural connection between a deep neural network and a sparsifying algorithm has been well observed and acknowledged in the literature, which has also transformed the ways that we are solving inverse problems with intrinsic low-dimensionality.

- Low-dimensional modeling has recently shown as a commonly used testbed for understanding generalization, (implicit) regularization, expressivity, and robustness in over-parameterized deep learning models.

- Various theoretical and numerical evidence supports that enforce certain isometry properties within the network often leads to improved performance for both training, generalization, and robustness.

- Low-dimensional priors learned through deep network demonstrated significantly improved performances over traditional methods in signal processing and machine learning.

### Given these exciting while less exploited connections, This two-day workshop aims to bring together experts in machine learning, applied mathematics, signal processing, and optimization, and to share recent progress and foster collaborations on mathematical foundations of deep learning. We would like to stimulate vibrate discussions towards bridging the gap between the theory and practice of deep learning by developing more principled and unified mathematical framework based on the theory and methods for learning low-dimensional models in high-dimensional space.

# Speakers

CMU, ECE

UT Austin, ECE

NYU Courant & CDS

UMD, CS

### Young Researchers

# Schedule

**All time are based on Eastern Time (EST)**

## Day 1 (Monday, Nov. 23rd)

### Session 1 (8:50 am - 12:00 pm, EST) - Moderator: Qing Qu

### 9:00 am - 10:00 am

**Optimal ****T****ransport, CycleGAN, and AdaIN for ****U****nsupervised ****L****earning in ****I****nverse ****P****roblems**** **

**Optimal**

**T**

**ransport, CycleGAN, and AdaIN for**

**U**

**nsupervised**

**L**

**earning in**

**I**

**nverse**

**P**

**roblems**

Deep learning approaches have been extensively studied as a fast and high-performance alternative for inverse problems. However, most existing deep learning works require matched reference data for supervised training, which are often difficult in inverse problems. To address this, we first derive a novel cycleGAN framework from optimal transport formulation of unsupervised learning problems, by simultaneously minimizing the statistical distances in both measurement and image domain. One of the most important advantages of this formulation is that depending on the knowledge of the forward operator, distinct variations of cycleGAN architecture can be derived: for example, one with two pairs of generators and discriminators, and the other with only a single pair of generator and discriminator. Even for the two generator cases, we show that the structural knowledge of the forward operator can lead to a simpler generator architecture which significantly simplifies the neural network training. Furthermore, to overcome the limitation of standard cycleGAN framework with two deep generators, we present a novel tunable CycleGAN architecture using a single generator. In particular, a single generator is implemented using adaptive instance normalization (AdaIN) layers so that the baseline generator can be switched between the forward and inverse generators by simply changing the AdaIN code. Thanks to the shared baseline network, the additional memory requirement and weight increases can be minimized, and the training can be done more stably even with small training data. Furthermore, by interpolating the AdaIN codes between the two domains, we can investigate continuous variation of the network output along the optimal transport path. We also provide various real world applications such as CT, MRI, ultrasound, phase retrieval, etc.

### (KAIST)

### 10:00 am - 11:00 am

**Towards a Better Global Landscape of GANs: How 2 Lines of Code Change Makes a Difference**

**Towards a Better Global Landscape of GANs: How 2 Lines of Code Change Makes a Difference**

In this talk, we present a global landscape analysis of GANs (generative adversarial nets). We prove that a class of Separable-GANs (SepGAN), including the popular JS-GAN and hinge-GAN, have exponentially many bad strict local minima; in addition, they are mode-collapse patterns. We prove that relativistic pairing GANs (RpGANs) haver no bad basins. RpGAN can be viewed as an unconstrained variant of W-GAN: RpGAN keeps the "pairing" idea of W-GAN, but adds an upper bounded shell function (e.g. logistic loss in JS-GAN). The empirical benefit of RpGANs has been demonstrated by practitioners in, e.g., ESRGAN and realnessGAN. We demonstrate that the better landscape of RpGANs makes a difference in practice, with 2 lines of code change. We predict that RpGANs has a bigger advantage over SepGAN for high resolution data (e.g. LSUN 256*256), imbalanced data and narrower nets (1/2 or 1/4 width), and our experiments on real data verify these predictions.

### (UIUC, ISE)

### 11:00 am - 12:00 pm

**Deep Networks from First Principles**

**Deep Networks from First Principles**

In this talk, we offer an entirely “white box’’ interpretation of deep (convolutional) networks. In particular, we show how modern deep architectures, linear (convolution) operators and nonlinear activations, and parameters of each layer can be derived from the principle of rate reduction (and invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of such a network have precise optimization, geometric, and statistical meaning. There are also several nice surprises from this principled approach that shed new light on fundamental relationships between forward (optimization) and backward (variation) propagation, between invariance and sparsity, and between deep networks and Fourier analysis.

### Lunch Break (12:00 pm - 1:00 pm, EST)

### Session 2 (1:00 pm - 4:30 pm, EST) - Moderator: Chong You

### 1:00 pm - 2:00 pm

**Early-Learning Regularization Prevents Memorization of Noisy Labels**

**Early-Learning Regularization Prevents Memorization of Noisy Labels**

In this talk we will discuss the problem of classification in the presence of noisy annotations. When trained on noisy labels, deep neural networks have been observed to first fit the training data with clean labels during an early learning phase, before eventually memorizing the examples with false labels. We show that early learning and memorization are fundamental phenomena in high-dimensional classification tasks, proving that they occur even in simple linear models. Motivated by these findings, we develop a new technique for noisy classification tasks, which exploits the progress of the early learning phase.

### (NYU Courant & CDS)

### 2:00 pm - 3:00 pm

**Preconditioning Helps: Faster Convergence in Statistical and Reinforcement Learning**

**Preconditioning Helps: Faster Convergence in Statistical and Reinforcement Learning**

Exciting progresses have been made in demystifying the efficacy of vanilla gradient methods in solving challenging nonconvex problems in statistical estimation and machine learning. In this talk, we discuss how the trick of preconditioning further boosts the convergence speed with minimal computation overheads through two examples: low-rank matrix estimation in statistical learning and policy optimization in entropy-regularized reinforcement learning. For low-rank matrix estimation, we present a new algorithm, called scaled gradient descent, that achieves linear convergence at a rate independent of the condition number of the low-rank matrix at near-optimal sample complexities for a variety of tasks. For policy optimization, we develop the first fast non-asymptotic convergence guarantee for entropy-regularized natural policy gradient methods in the tabular setting for discounted Markov decision processes. By establishing its global linear convergence at a near dimension-free rate, we provide theoretical footings to the empirical success of entropy-regularized natural policy gradient methods. Based on arXiv 2005.08898 and 2007.06558.

### (CMU, ECE)

### 3:00 pm - 4:00 pm

### Panel Discussion

**Panelists:** *Yuejie Chi, Tom Goldstein, Ambar Pal, Rene Vidal*

**Panelists:**

*Yuejie Chi, Tom Goldstein, Ambar Pal, Rene Vidal*

### Panel Moderator:

## Day 2 (Tuesday, Nov. 24th)

### Session 3 (9:00 am - 12:00 pm, EST) - Moderator: Zhihui Zhu

### 9:00 am - 10:00 am

**Dual Principal Component Pursuit**

**Dual Principal Component Pursuit**

We consider the problem of learning a union of subspaces from data corrupted by outliers. Classical approaches based on convex optimization are designed for subspaces whose dimension is small relative to the ambient dimension. Here we propose a non-convex approach called Dual Principal Component Pursuit (DPCP), which is specifically designed for subspaces whose dimension is close to the ambient dimension. We provide theoretical guarantees under which every global solution to DPCP is a vector in the orthogonal complement to one of the subspaces. We also study various optimization algorithms for solving the DPCP problem and analyze their convergence to a global minimum. Experiments show that the proposed methods are able to handle more outliers and higher relative dimensions than state-of-the-art methods.

### (JHU, BME & MINDS)

### 10:00 am - 11:00 am

**Neural Networks at Finite Width and Large Depth**

**Neural Networks at Finite Width and Large Depth**

Deep neural networks are often considered to be complicated "black boxes," for which a full systematic analysis is not only out of reach but also impossible. In this talk, which is based on ongoing joint work with Daniel Adam Roberts and Sho Yaida, I will make the opposite claim. Namely, that deep neural networks at initialization are exactly solvable models. Our approach applies to networks at finite width n and large depth L, the regime in which they are used in practice. A key point will be the emergence of a notion of "criticality," which involves a fine-tuning of model parameters (weight and bias variances). At criticality, neural networks are particularly well-behaved but still exhibit a tension between large values for n and L, with large values of n tending to make neural networks more like Gaussian processes and large values of L amplifying higher cumulants. Our analysis has a range of consequences such both for understanding feature learning and for optimal settings of hyperparameters such as weight/bias variances and learning rates.

### (Princeton U, ORFE)

### 11:00 am - 12:00 pm

**Deep Networks and the Multiple Manifold Problem**

**Deep Networks and the Multiple Manifold Problem**

We study the multiple manifold problem, a binary classification task modeled on applications in machine vision, in which a deep fully-connected neural network is trained to separate two low-dimensional submanifolds of the unit sphere. We provide an analysis of the one-dimensional case, proving for a simple manifold configuration that when the network depth *L *is large relative to certain geometric and statistical properties of the data, the network width *n *grows as a sufficiently large polynomial in *L*, and the number of i.i.d. samples from the manifolds is polynomial in *L*, randomly-initialized gradient descent rapidly learns to classify the two manifolds perfectly with high probability. Our analysis demonstrates concrete benefits of depth and width in the context of a practically-motivated model problem: the depth acts as a fitting resource, with larger depths corresponding to smoother networks that can more readily separate the class manifolds, and the width acts as a statistical resource, enabling concentration of the randomly-initialized network and its gradients. The argument centers around the “neural tangent kernel” of Jacot et al. and its role in the nonasymptotic analysis of training overparameterized neural networks; to this literature, we contribute essentially optimal rates of concentration for the neural tangent kernel of deep fully-connected networks, requiring width *n * *L *poly(*d*0) to achieve uniform concentration of the initial kernel over a *d*0-dimensional submanifold of the unit sphere S*n*0−1, and a nonasymptotic framework for establishing generalization of networks trained in the “NTK regime” with structured data. The proof makes heavy use of martingale concentration to optimally treat statistical dependencies across layers of the initial random network. This approach should be of use in establishing similar results for other network architectures.

Joint work with Sam Buchanan and Dar Gilboa

### (Columbia U, EE & DSI)

### Lunch Break (12:00 pm - 1:00 pm, EST)

### Session 4 (1:00 pm - 4:30 pm, EST) - Moderator: Atlas Wang

### 1:00 pm - 2:00 pm

**Deep Generative **M**odels and Inverse Problems**

**Deep Generative**M

**odels and Inverse Problems**

Modern deep generative models like GANs, VAEs and invertible flows are achieving amazing results on modeling high-dimensional distributions, especially for images. We will show how they can be used to solve inverse problems by generalizing compressed sensing beyond sparsity, extending the theory of Restricted Isometries to sets created by generative models. We will present the general framework, new results and open problems in this space.

### (UT Austin, ECE)

### 2:00 pm - 3:00 pm

### Seeking High Dimensionality in Neural Networks: a Scientific Perspective on How High-dimensional Parameter Spaces Enable Generalization of Low-dimensional Image Manifolds

This talk will dissect the role of dimensionality in neural networks using scientific and empirical methods. First, I'll discuss the role of low dimensionality in learning, and present evidence for low dimensionality in complex image sets. We obtain this evidence by applying dimensionality estimators to sets of natural images (e.g., ImageNet) and discovering that their apparent dimensionality is surprisingly low. Furthermore, we observe that more data is needed to learn image manifolds of higher dimensionality. Second, I'll discuss the role of high-dimensionality in learning. Using visualization and empirical studies, I'll discuss the sharp/flat hypothesis, and present evidence that the curse of dimensionality in high dimensional parameter spaces plays an important role in network generalization.

### (UMD, CS)

### 3:00 pm - 4:30 pm

### Young Researcher Spotlight Talks

**Talk 1:** Optimization Of Deep Network From a infinite depth point of view

**Speaker:** Yiping Lu, Stanford University

**Time:** 3:00 pm - 3:10 pm

**Abstract:** Neural networks have become state-of-the-art models in numerous machine learning tasks and strong empirical performance is often achieved by deeper networks. One landmark example is the residual network (ResNet) , which can be efficiently optimized even at extremely large depth such as 1000 layers. However, there exists a gap between this empirical success and the theoretical understanding: ResNets can be trained to almost zero loss with standard stochastic gradient descent, yet it is known that larger depth leads to increasingly non-convex landscape even the the presence of residual connections. In this talk we consider an ODE as the infinite depth limit of the neural network and give a mean field analysis of training the infinite depth neural network. With the ODE analysis, we exploit the structures that the optimization objectives is encountered with deep neural networks. The new point of view enables us to develop a fast NN–specific algorithm for adversarial training, called YOPO which reduces the total number of full forward and backward propagation to only one for each group of adversary updates. This talk is based on the speaker’s work published at Neurips2019 and ICML2020.

**Talk 2:** A Banach Space Representer Theorem for Single-Hidden Layer Neural Networks

**Speaker:** Rahul Parhi, University of Wisconsin - Madison

**Time:** 3:10 pm - 3:20 pm

**Abstract:** We develop a variational framework to understand the properties of the functions learned by neural networks fit to data. We propose and study a family of continuous-domain linear inverse problems with total variation-like regularization in the Radon domain subject to data fitting constraints. We derive a representer theorem showing that finite-width, single-hidden layer neural networks are solutions to these inverse problems. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that the neural network problem is posed over a non-Hilbertian Banach space. While the learning problems are posed in the continuous-domain, similar to kernel methods, the problems can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, our result gives insight into functional characteristics of trained neural networks and also into the design neural network regularizers.

**Talk 3:** Distributional Generalization: A New Kind of Generalization

**Speaker:** Yamini Bansal, Harvard University

**Time:** 3:20 pm - 3:30 pm

**Abstract:** We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our empirical understanding of interpolating classifiers. Based on the paper https://arxiv.org/abs/2009.08092 Joint work with Preetum Nakkiran.

**Talk 4:** Provable Representation Learning

**Speaker:** Qi Lei, Princeton University

**Time:** 3:30 pm - 3:40 pm

**Abstract:** Modern machine learning models are transforming applications in various domains at the expense of a large amount of hand-labeled data. In contrast, humans and animals first establish their concepts or impressions from data observations. The learned concepts then help them to learn specific tasks with minimal external instructions. Deep representation learning seeks a similar procedure: 1) learn a data representation that forms a concise impression of the data; 2) then transfer the data representation to downstream tasks with few labeled samples and simple models. In this talk, we study two forms of representation learning: supervised pre-training from multiple tasks and self-supervised learning. With supervised pre-training, we prove that it can pool the data from all source tasks to learn a good representation that transfers to downstream tasks (possibly with covariate shift) with few labeled examples. For self-supervised learning, we prove that under an approximate conditional independence assumption, it learns representations that linearly separate downstream targets. For both frameworks, representation learning drastically reduces sample complexity for downstream tasks.

**Talk 5:** Learned Generative Priors for Compressive Phase Retrieval

**Speaker:** Oscar Leong, Rice University

**Time:** 3:40 pm - 3:50 pm

**Abstract:** Deep generative models, such as Generative Adversarial Networks (GANs), have quickly become the state-of-the-art for natural image generation, leading to new approaches for enforcing structural priors in a variety of inverse problems. In contrast to traditional approaches enforcing sparsity, GANs provide a low-dimensional parameterization of the natural signal manifold, allowing for signal recovery to be posed as a direct optimization problem whose search space is restricted to the range of the GAN. Moreover, the dimensionality of this parameterization can be lower than that of the sparsity level of a particular signal class. In this talk, we investigate the use of GAN priors in a nonlinear inverse problem known as compressive phase retrieval, which asks to recover a signal from undersampled quadratic measurements. We show that a subgradient descent algorithm on a nonsmooth, nonconvex least squares problem can provably recover a signal in the range of a generative model given an optimal number of measurements with respect to the input dimension of the generator. This overcomes a notorious theoretical bottleneck in phase retrieval, where the best known efficient algorithms under a sparsity prior exhibit a sub-optimal quadratic sample complexity. We corroborate these theoretical results with empirics showing that exploiting generative priors in phase retrieval tasks can significantly outperform sparsity priors.

**Talk 6:** Explicit and Implicit Regularization in Overparameterized Least Squares Regression

**Speaker:** Denny Wu, University of Toronto

**Time:** 3:50 pm - 4:00 pm

**Abstract:** We study the generalization properties of the generalized ridge regression estimator in the overparameterized regime. We derive the exact prediction risk (generalization error) in the proportional asymptotic limit, which allows us to rigorously characterize the surprising empirical observation that the optimal ridge regularization strength can be *negative*, as well as the benefit of weighted regularization. We then connect the ridgeless limit of this estimator to the implicit bias of preconditioned gradient descent (e.g., natural gradient descent). We compare the generalization performance of first- and second-order optimizers, and identify different factors that affect this comparison. Our theoretical finding also aligns with empirical observation in various neural network experiments.

**Talk 7:** Understanding Deep Architectures with Reasoning Layer

**Speaker:** Xinshi Chen, Georgia Institution of Technology

**Time:** 4:00 pm - 4:10 pm

**Abstract:** Recently, there has been a surge of interest in combining deep learning models with reasoning in order to handle more sophisticated learning tasks. In many cases, a reasoning task can be solved by an iterative algorithm. This algorithm is often unrolled, and used as a specialized layer in the deep architecture, which can be trained end-to-end with other neural components. Although such hybrid deep architectures have led to many empirical successes, the theoretical foundation of such architectures, especially the interplay between algorithm layers and other neural layers, remains largely unexplored. In this paper, we take an initial step towards an understanding of such hybrid deep architectures by showing that properties of the algorithm layers, such as convergence, stability, and sensitivity, are intimately related to the approximation and generalization abilities of the end-to-end model. Furthermore, our analysis matches closely our experimental observations under various conditions, suggesting that our theory can provide useful guidelines for designing deep architectures with reasoning layers.

**Talk 8:** A Game Theoretic Analysis of Additive Adversarial Attacks and Defenses

**Speaker:** Ambar Pal, Johns Hopkins University

**Time:** 4:10 pm - 4:20 pm

**Abstract:** Research in adversarial learning follows a cat and mouse game between attackers and defenders where attacks are proposed, they are mitigated by new defenses, and subsequently new attacks are proposed that break earlier defenses, and so on. However, it has remained unclear as to whether there are conditions under which no better attacks or defenses can be proposed. In this talk, I will describe a game-theoretic framework for studying attacks and defenses which exist in equilibrium. Under a locally linear decision boundary model for the underlying binary classifier, we will see that the Fast Gradient Method attack and a Randomized Smoothing defense form a Nash Equilibrium. We will see that this equilibrium defense can be approximated given finitely many samples from a data-generating distribution, and look at a generalization bound for the performance of this approximation.

### Talk Moderator:

# Organizers

# Sponsors

## The workshop is sponsored by IEEE Computational Imaging Technical Committee

** This workshop is a special event of the ****Signal Processing And Computational imagE formation (SPACE)**** ****webinar**** **