Contributed talks

Complex Transformer: A Framework for Modeling Complex-Valued Sequence, Martin Ma (Carnegie Mellon University); Muqiao Yang (Carnegie Mellon University); Dongyu Li (Carnegie Mellon University); Yao-Hung Tsai (Carnegie Mellon University); Ruslan Salakhutdinov (Carnegie Mellon University)

Major deep learning models barely use complex numbers. However, speech, signal and audio data are naturally complex after Fourier Transform, and studies have shown potentially richer representation of complex nets. We propose a Complex Transformer, which incorporates the transformer model as a backbone, and we develop attention and encoder-decoder network operating for complex input. The model achieves state-of-the-art performance on the MusicNet dataset and an In-phase Quadrature (IQ) signal dataset, which shows complex network capable of capturing richer information. An anonymous version of the implementation which reproduces the experimental results is available at https://anonymous.4open.science/r/60540470-3193-46ca-9392-72f07a0e8cd1/.

Non-Gaussian Processes and Neural Networks at Finite Widths, Sho Yaida (Facebook AI Research)

Gaussian processes are ubiquitous in nature and engineering. A case in point is a class of neural networks in the infinite-width limit, whose priors correspond to Gaussian processes. Here we perturbatively extend this correspondence to finite-width neural networks, yielding non-Gaussian processes as priors. Our new recursive formalism allows us to track the flow of preactivation distributions by progressively integrating out random variables from lower to higher layers, reminiscent of renormalization-group flow. We further perform Bayesian inference with non-Gaussian priors, showing the regularization effects of finite widths.

Asymptotics of Wide Networks from Feynman Diagrams, Guy Gur-Ari (Google); Ethan Dyer (Google)

Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.

Fantastic Generalization Measures and Where to Find Them, YiDing Jiang (Google); Behnam Neyshabur (Google); Dilip Krishnan (Google); Hossein Mobahi (Google Research); Samy Bengio (Google Research, Brain Team)

Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretical bounds and empirically motivated measures. However, most papers proposing such measures only study a small set of models, leaving open the question of whether the conclusion drawn from those experiments would generalize to other settings. We present the first large scale study of generalization bounds and measures in deep networks. We train over two thousand convolutional networks with systematic changes in important hyper-parameters. Hoping to uncover potentially causal relationships between each measure and generalization, we run carefully controlled experiments and use a modified form of rank correlation coefficient to compare different measures overall in individual experiment categories. We analyze the results and show surprising failures of some measures as well as promising measures for further research.

Training Batchnorm and Only Batchnorm, Jonathan Frankle (MIT); David J Schwab (ITS, CUNY Graduate Center); Ari S Morcos (Facebook AI Research (FAIR))

Batch normalization is an indispensable tool for training deep neural networks. Here, we ask a simple question: to what extent can we train networks in which only the batch normalization parameters are trainable? Surprisingly, we found that we can train networks with random features and learned batch normalization parameters to accuracies well above chance. To further study this effect, we explored separately training with only the affine parameters, and, in contrast to the traditional normalization-based motivation of batch normalization, found that the affine parameters alone were sufficient for this effect (in shallower ResNets). For example, on a sufficiently deep residual network, we achieve 82% accuracy on CIFAR-10 by training in this fashion. These experiments highlight the under-appreciated role of the non-normalization aspects of batch normalization.