Contributed posters

Observational Overfitting in Reinforcement Learning, Xingyou Song (Google Brain); YiDing Jiang (Google); Yilun Du (MIT); Behnam Neyshabur (Google)

A major component of overfitting in model-free reinforcement learning (RL) involves the case where the agent may mistakenly correlate reward with certain spurious features from the observations generated by the Markov Decision Process (MDP). We provide a general framework for analyzing this scenario, which we use to design multiple synthetic benchmarks from only modifying the observation space of an MDP. When an agent overfits to different observation spaces even if the underlying MDP dynamics is unchanged, we term this observational overfitting. Our experiments expose intriguing properties especially with regards to implicit regularization, and also corroborate results from previous works in RL generalization and supervised learning (SL).

A Reparameterization-Invariant Flatness Measure for Deep Neural Networks, Henning Petzka (Lund University); Linara Adilova (Fraunhofer IAIS); Michael Kamp (University of Bonn); Cristian Sminchisescu (Lund University, Google Research)

The performance of deep neural networks is often attributed to their automated, task-related feature construction. It remains an open question, though, why this leads to solutions with good generalization, even in cases where the number of parameters is larger than the number of samples. Back in the 90s, Hochreiter and Schmidhuber observed that flatness of the loss surface around a local minimum correlates with low generalization error. For several flatness measures, this correlation has been empirically validated. However, it has recently been shown that existing measures of flatness cannot theoretically be related to generalization due to a lack of invariance with respect to reparameterizations. We propose a natural modification of existing flatness measures that results in invariance to reparameterization.

Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency, Elad Hoffer (Habana Labs); Berry Weinstein (Habana Labs); Itay Hubara (Technion); Tal Ben-Nun (ETH Zurich); Torsten Hoefler (ETH Zurich); Daniel Soudry (Technion)

Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of a specific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that mixes several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images at test time. For instance, we receive a 76.43% top-1 accuracy on ImageNet using ResNet50 with an image size of 160, which matches the accuracy of the baseline model with 2x fewer computations. Furthermore, for a given image size used at test time, we show this method can either accelerate training, or improve the final test accuracy. For example, we are able to reach a 79.27% accuracy with a model evaluated at a 288 spatial size for a relative improvement of 14% over the baseline.

X-BERT: eXtreme Multi-label Text Classification using Bidirectional Encoder from Transformers, Wei-Cheng Chang (Carnegie Mellon University); Hsiang-Fu Yu (Amazon); Kai Zhong (Amazon); Yiming Yang (Carnegie Mellon University); Inderjit S. Dhillon (UT Austin & Amazon)

Extreme multi-label text classification (XMC) concerns tagging input text with the most relevant labels from an extremely large set. While the use of pretrained models such as BERT have achieved significant progress on many NLP tasks including sentence classification with small label sets, there are several challenges in extending BERT to the XMC problem, such as (i) the difficulty of capturing dependencies or correlations among labels, and (ii) the scalability to the extreme label setting because of the Softmax bottleneck. To overcome these challenges, we propose X-BERT, the first scalable solution to finetune BERT models on the XMC problem. Specifically, X-BERT leverages both the label and input text to build label representations, which induces semantic label clusters to better model label dependencies. At the heart of X-BERT is a procedure to finetune BERT models to capture the contextual relations between input text and the induced label clusters. Finally, an ensemble of the different BERT models trained on heterogeneous label clusters leads to our best final model, which leads to a state-of-the-art XMC method. In particular, on a Wiki dataset with around 0.5 million labels, the precision@1 of X-BERT is 67.87%, a substantial improvement over the neural baseline fastText and a state-of-the-art XMC approach Parabel, which achieve 32.58% and 60.91% precision@1, respectively.

Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?, Jeremy Cohen (Carnegie Mellon University); Simran Kaur (Carnegie Mellon University); Zachary Lipton (Carnegie Mellon University)

Optimizing over the input pixels to a standard convolutional network to maximize the score of some target class generally produces a grainy-looking version of the original image. However, researchers have demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we demonstrate that these perceptually-aligned gradients also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding suggests that perceptually-aligned gradients may be a general property of robust classifiers, and not a curious consequence of adversarial training. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness.

On Orthogonal Jacobian Regularization in Deep Neural Networks, Sang Keun Choe (Carnegie Mellon University); Hosan Jeong (HDXWILL); Jaime Carbonell (Carnegie Mellon University)

While the strong representational capacity of deep neural networks has led significant advancements in various fields including computer vision and natural language processing, this advantage oftentimes comes with difficulties in optimization. Particularly, the vanishing/exploding gradient problem complicates training of deep neural networks by iteratively multiplying contractive/expansive Jacobian matrices to the error signal under the backpropagation framework. In this work, we aim to mitigate this issue by proposing orthogonal Jacobian regularization, which encourages the Jacobian matrix of each layer to be norm-preserving, and thereby stabilize the norm of gradients throughout backpropagation. Furthermore, we theoretically show that the proposed regularization has a close relationship to minimizing tied-weights autoencoder reconstruction loss for each layer, and empirically demonstrate that the later can effectively alleviate the vanishing/exploding gradient problem as well. Tested on MNIST and CIFAR-100 image classification benchmarks, orthogonal Jacobian regularization is shown to consistently improve the final classification accuracy and stabilize training of deep neural networks.

Understanding 3D CNN Behavior for Alzheimer's Disease Diagnosis from Brain PET Scan, Jyoti Islam (Georgia State University); Yanqing Zhang (Georgia State University)

In recent days, Convolutional Neural Networks (CNN) have demonstrated impressive performance in medical image analysis. However, there is a lack of clear understanding of why and how the Convolutional Neural Network performs so well for image analysis task. How CNN analyzes an image and discriminates among samples of different classes are usually considered as non-transparent. As a result, it becomes difficult to apply CNN based approaches in clinical procedure and automated disease diagnosis system. In this paper, we consider this issue and work on visualizing and understanding the decision of Convolutional Neural Network for Alzheimer's Disease (AD) Diagnosis. We develop a 3D deep convolutional neural network for AD diagnosis using brain PET scans and propose using five visualizations techniques - Sensitivity Analysis (Backpropagation), Guided Backpropagation, Occlusion, Brain Area Occlusion, and Layer-wise Relevance Propagation (LRP) to understand the decision of the CNN by highlighting the relevant areas in the PET data.

A Simple Dynamic Learning Rate Tuning Algorithm For Automated Training of DNNs, Koyel Mukherjee (IBM Research - India); Alind Khare (Georgia Institute of Technology); Yogish Sabharwal (IBM Research - India); Ashish Verma (IBM Research)

Training neural networks on image datasets generally require extensive experimentation to find the optimal learning rate (LR) regime. Especially, for the cases of adversarial training or for training a newly synthesized model, best LR regime is unknown. We propose an automated adaptive algorithm for determining the learning rate trajectory, that works across datasets and models for both natural and adversarial training, without requiring any manual tuning. We theoretically discuss the algorithm's convergence and empirically validate our algorithm extensively.

Is Feature Diversity Necessary in Neural Networks Initialization?, Yaniv Blumenfeld (Technion); Daniel Soudry (Technion); Dar Gilboa (Columbia University)

Standard practice in training neural networks involves initializing the weights in an independent fashion. The results of recent work suggest that feature "diversity" at initialization plays an important role in training the network. However, other initialization schemes with reduced feature diversity have also been shown to be viable. In this work, we conduct a series of experiments aimed at elucidating the importance of feature diversity at initialization. Experimenting on a shallow network, we show that a complete lack of diversity is harmful to training, but its effect can be counteracted by a relatively small addition of noise. Furthermore, we construct a deep convolutional network with identical features at initialization that can be trained to reach accuracy matching its standard-initialized counterpart.

A GAN based solver idea for derivative-free optimization problems, Hubert Ramsauer (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria); Johannes Brandstetter (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria); Michael Gillhofer (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria); Sepp Hochreiter (LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria)

We propose a GAN based approach for derivative-free function optimization. The idea is to formulate the optimization process as an adversarial game where the generator has to propose new samples and the discriminator has to assess the quality of the samples with respect to the black-box function f . However, instead of attempting to approximate f directly, the discriminator only has to solve a binary classification task in the local region populated by the generated samples. We demonstrate the efficacy of our approach by applying it to an artificially gener- ated topology optimization problem. We show that, despite not having access to derivatives of f our method leads to similar results like more traditional topology optimization methods. We hypothesize, that our approach is a potential neural counterpart to gradient-free optimization methods. We aim at a more profound and theoretically inclined study.

Measuring Arithmetic Extrapolation Performance, Andreas Madsen (Computationally Demanding); Alexander Rosenberg Johansen (Technical University of Denmark)

The Neural Arithmetic Logic Unit (NALU) is a neural network layer that can learn exact arithmetic operations between the elements of a hidden state. The goal of NALU is to learn perfect extrapolation, which requires learning the exact underlying logic of an unknown arithmetic problem. Evaluating the performance of the NALU is non-trivial as one arithmetic problem might have many solutions. As a consequence, single-instance MSE has been used to evaluate and compare performance between models. However, it can be hard to interpret what magnitude of MSE represents a correct solution and models sensitivity to initialization. We propose using a success-criterion to measure if and when a model converges. Using a success-criterion we can summarize success-rate over many initialization seeds and calculate confidence intervals. We contribute a generalized version of the previous arithmetic benchmark to measure models sensitivity under different conditions. This is, to our knowledge, the first extensive evaluation with respect to convergence of the NALU and its sub-units. Using a success-criterion to summarize 4800 experiments we find that consistently learning arithmetic extrapolation is challenging, in particular for multiplication.

Mode Connectivity and Sparse Neural Networks, Jonathan Frankle (MIT); Gintare Karolina Dziugaite (University of Cambridge & Element AI); Daniel M. Roy (University of Toronto); Michael Carbin (MIT)

We uncover a connection between two seemingly unrelated empirical phenomena: mode connectivity and sparsity. There is growing catalog of situations where, across multiple runs, SGD learns weights that fall into minima that are connected (mode connectivity). A striking example is described by Nagarajan & Kolter (2019): they observe that test error on MNIST does not change along the linear path connecting the end points of two independent SGD runs that start from the same random initialization. On the other hand, there is the lottery ticket hypothesis of Frankle & Carbin (2019), where dense, randomly initialized networks have sparse subnetworks capable of training in isolation to full accuracy. However, neither phenomenon scales beyond small vision networks. We start by proposing a technique to find sparse subnetworks after initialization. We observe that these subnetworks train to full accuracy only when two SGD runs for the same subnetwork are connected by linear paths with the no change in test error. Our findings connect the existence of sparse subnetworks that train to high accuracy on ImageNet with the dynamics of optimization via mode connectivity.

Modelling the influence of data structure on learning in neural networks, Sebastian Goldt (Institut de Physique théorique, Paris); Marc Mézard (Ecole normale supérieure); Florent Krzakala (École Normale Supérieure); Lenka Zdeborova (CEA Saclay)

The lack of crisp mathematical models that capture the structure of real-world data sets is a major obstacle to the detailed theoretical understanding of deep neural networks. Here, we first demonstrate the effect of structured data sets by experimentally comparing the dynamics and the performance of two-layer networks trained on two different data sets: (i) an unstructured synthetic data set containing random i.i.d. inputs, and (ii) a simple canonical data set containing MNIST images. Our analysis reveals two phenomena related to the dynamics of the networks and their ability to generalise that only appear when training on structured data sets. Second, we introduce a generative model for data sets, where high-dimensional inputs lie on a lower-dimensional manifold and have labels that depend only on their position within this manifold. We call it the hidden manifold model (HMM) and we experimentally demonstrate that training networks on data sets drawn from this model reproduces both the phenomena seen during training on MNIST.

A Non-Parametric Method to Study Overfitting, Alan Mishchenko (University of California at Berkeley); Satrajit Chatterjee (Google AI)

A novel method is proposed to study overfitting in any machine learning model that can be translated into an equivalent gate-level Boolean circuit. The method relies only on the training data and does not use the evaluation data or knowledge of the type and parameters of the machine learning model. The proposed method has been successfully applied to lookup tables, random forests, and neural networks, resulting in an insight into why neural networks generalize.

Non-Gaussianity of Stochastic Gradient Noise, Abhishek Panigrahi (Microsoft Research); Raghav Somani (University of Washington); Navin Goyal (Microsoft Research India); Praneeth Netrapalli (Microsoft Research)

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.

Comparing Fine-tuning and Rewinding in Neural Network Pruning, Alex Renda (MIT); Jonathan Frankle (MIT); Michael Carbin (MIT)

Neural network pruning is a popular technique for reducing inference costs by removing connections, neurons, or other structure from the network. In the literature, pruning typically follows a standard procedure: train the network, remove unwanted structure (pruning), and train the resulting network further to recover accuracy (fine-tuning). In this paper, we explore an alternative to fine-tuning: rewinding. Rather than continuing to train the resultant pruned network (fine-tuning), rewind the remaining weights to their values from earlier in training, and re-train the resultant network for the remainder of the original training process. We find that this procedure, which repurposes the strategy for finding lottery tickets presented by Frankle et al. (2019), makes it possible to prune networks further than is possible with fine-tuning for a given target accuracy, provided that the weights are rewound to a suitable point in training. We also find that there is a wide range of suitable rewind points that achieve higher accuracy than fine-tuning across all tested networks. Based on these results, we argue that practitioners should explore rewinding as an alternative to fine-tuning for neural network pruning.

The Generalization-Stability Tradeoff in Neural Network Pruning, Brian Bartoldson (Florida State University); Ari S Morcos (Facebook AI Research (FAIR)); Adrian Barbu (Florida State University, USA); Gordon Erlebacher (Florida State University)

Pruning neural network parameters is often viewed as a means to compress models, but pruning has also been motivated by the desire to prevent overfitting. This motivation is particularly relevant given the perhaps surprising observation that a wide variety of pruning approaches increase test accuracy despite sometimes massive reductions in parameter counts. To better understand this phenomenon, we analyze the behavior of pruning over the course of training, finding that pruning's effect on generalization relies more on the instability it generates (defined as the drops in test accuracy immediately following pruning) than on the final size of the pruned model. Further, we show similarities between pruning and regularizing by injecting noise, suggesting a mechanism for pruning-based generalization improvements that is compatible with the strong generalization recently observed in over-parameterized networks.

HighRes-net: Multi-Frame Super-Resolution by Recursive Fusion, Michel Deudon (Element AI); Alfredo Kalaitzis (Element AI); Israel Goytom (Mila); Md Rifat Arefin (Mila); Zhichao Lin (Element AI); Kris Sankaran (Mila); Vincent Michalski (Universite de Montreal); Samira Ebrahimi Kahou (McGill/Mila); Julien Cornebise (Element AI); Yoshua Bengio (Mila)

Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet from deforestation to human rights violations that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-res views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-res pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.

Implicit Regularization in Deep Learning: A View from Function Space, Aristide Baratin (Mila, Université de Montréal)

A key factor underlying the generalization ability of deep learning models is the implicit regularization effect of the training procedure. How does gradient descent control the capacity of neural networks ? In this paper, we approach this problem from a geometrical point of view. We show that, even in the case of linear models, there is no free lunch: meaningful capacity control norms depend both on the geometry of the function class and the structure of the data. Our analysis pinpoints an important mechanism inducing implicit regularization, suggesting deep learning models generalize by learning optimized low rank kernels aligned with the data.

The intriguing role of module criticality in the generalization of deep networks, Niladri S Chatterji (UC Berkeley); Behnam Neyshabur (Google); Hanie Sedghi (Google)

We study the phenomenon that some modules of a deep neural network (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained values, results in a large drop in the network's performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measure, called module criticality, based on the shape of the valleys that connects the initial and final values of the module parameters. We formulate how generalization relates to the module criticality, and show that this measure is able to explain the superior generalization performance of some architectures over others, whereas earlier measures fail to do so.

Neural Tangents: Fast and Easy Infinite Neural Networks in Python, Roman Novak (Google Brain); Lechao Xiao (Google Brain); Jiri Hron (University of Cambridge); Jaehoon Lee (Google Brain); Jascha Sohl-Dickstein (Google Brain); Samuel S Schoenholz (Google Brain)

Neural Tangents is a library for working with infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual, or in their infinite-width limit. For the infinite-width networks, Neural Tangents performs exact inference either via Bayes' rule or gradient descent, and generates the corresponding Neural Network Gaussian Process and Neural Tangent kernels. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices. Neural Tangents is available at https://github.com/google/neural-tangents. We also provide an accompanying interactive Colab notebook at https://colab.sandbox.google.com/github/google/neural-tangents/blob/master/notebooks/neural_tangents_cookbook.ipynb.

Individual Predictions Matter: an Example from Deep Learning for Medical Imaging, Jessica Forde (Brown University); John Zech (New York Presbyterian - Columbia)

We reproduced the results of CheXNet with fixed hyperparameters and 50 different random seeds to identify 14 diagnoses in chest X-rays. Because CheXNet fine3 tunes a pre-trained DenseNet, the seed affects the data batches but not the initialized weights. There was substantial variability in diagnostic predictions for a patient’s X-ray between model runs (mean log(Pmax/Pmin) 2.45, coefficient of variation 0.543). This variability at the level of individual X-rays was not fully reflected in variability in overall AUC, which was relatively stable on a large test set due to the law of large numbers. Averaging over predictions from 10 separately-trained CNNs reduced variability by nearly 70% (mean coefficient of variation reduced from 0.543 to 0.169, t-test 15.96, p-value < 0.0001). We encourage scientists interested in reproducibility and engineers deploying real world systems in domains such as medicine to be aware of the variability of CNNs on a given test image due to the seed used in training and that ensembling predictions from multiple models can minimize this effect, producing more consistent predictions for individuals.

Complex Transformer: A Framework for Modeling Complex-Valued Sequence, Martin Ma (Carnegie Mellon University); Muqiao Yang (Carnegie Mellon University); Dongyu Li (Carnegie Mellon University); Yao-Hung Tsai (Carnegie Mellon University); Ruslan Salakhutdinov (Carnegie Mellon University)

Major deep learning models barely use complex numbers. However, speech, signal and audio data are naturally complex after Fourier Transform, and studies have shown potentially richer representation of complex nets. We propose a Complex Transformer, which incorporates the transformer model as a backbone, and we develop attention and encoder-decoder network operating for complex input. The model achieves state-of-the-art performance on the MusicNet dataset and an In-phase Quadrature (IQ) signal dataset, which shows complex network capable of capturing richer information. An anonymous version of the implementation which reproduces the experimental results is available at https://anonymous.4open.science/r/60540470-3193-46ca-9392-72f07a0e8cd1/.

Non-Gaussian Processes and Neural Networks at Finite Widths, Sho Yaida (Facebook AI Research)

Gaussian processes are ubiquitous in nature and engineering. A case in point is a class of neural networks in the infinite-width limit, whose priors correspond to Gaussian processes. Here we perturbatively extend this correspondence to finite-width neural networks, yielding non-Gaussian processes as priors. Our new recursive formalism allows us to track the flow of preactivation distributions by progressively integrating out random variables from lower to higher layers, reminiscent of renormalization-group flow. We further perform Bayesian inference with non-Gaussian priors, showing the regularization effects of finite widths.

Asymptotics of Wide Networks from Feynman Diagrams, Guy Gur-Ari (Google); Ethan Dyer (Google)

Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.

Fantastic Generalization Measures and Where to Find Them, YiDing Jiang (Google); Behnam Neyshabur (Google); Dilip Krishnan (Google); Hossein Mobahi (Google Research); Samy Bengio (Google Research, Brain Team)

Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretical bounds and empirically motivated measures. However, most papers proposing such measures only study a small set of models, leaving open the question of whether the conclusion drawn from those experiments would generalize to other settings. We present the first large scale study of generalization bounds and measures in deep networks. We train over two thousand convolutional networks with systematic changes in important hyper-parameters. Hoping to uncover potentially causal relationships between each measure and generalization, we run carefully controlled experiments and use a modified form of rank correlation coefficient to compare different measures overall in individual experiment categories. We analyze the results and show surprising failures of some measures as well as promising measures for further research.

Training Batchnorm and Only Batchnorm, Jonathan Frankle (MIT); David J Schwab (ITS, CUNY Graduate Center); Ari S Morcos (Facebook AI Research (FAIR))

Batch normalization is an indispensable tool for training deep neural networks. Here, we ask a simple question: to what extent can we train networks in which only the batch normalization parameters are trainable? Surprisingly, we found that we can train networks with random features and learned batch normalization parameters to accuracies well above chance. To further study this effect, we explored separately training with only the affine parameters, and, in contrast to the traditional normalization-based motivation of batch normalization, found that the affine parameters alone were sufficient for this effect (in shallower ResNets). For example, on a sufficiently deep residual network, we achieve 82% accuracy on CIFAR-10 by training in this fashion. These experiments highlight the under-appreciated role of the non-normalization aspects of batch normalization.