Abstracts

Samuel Schoenholz

Title: Priors for Deep Infinite Networks

Abstract: In the current practice of machine learning there are two related, yet distinct, concepts that have become conflated. On the one hand, models have a theoretically optimal performance on a given dataset or task. In principal, we would like to perform model selection to maximize this peak performance. On the other hand, we have the volume of hyperparameters over which the model will achieve reasonable performance. This leads to a scenario where it becomes impossible to tell whether an architectural tweak actually improves model performance or whether it just makes the model more trainable. In this talk we will describe an ongoing effort to make hyperparameter selection more systematic. To do this we will consider the random initialization of a neural network as placing a prior over the space of functions. We show that in the infinite-size limit, this prior approaches a well-defined mean-field limit for a wide range of network architectures. Moreover, we argue that fluctuations around mean field theory can be computed using random matrix theory and statistical field theory. By understanding the properties of this prior we can precisely characterize the volume of hyperparameter space where models may be trained.

Sho Yaida - slides

Title: Fluctuation-Dissipation Relation for Stochastic Gradient Descent

Abstract: The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. In this talk, I present a stationary fluctuation-dissipation relation that links measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations can in particular be used to adaptively set training schedule.

Levent Sagun - slides

Title: Over-parametrization in neural networks: observations and a definition

Abstract: An excursion around the ideas for why the stochastic gradient descent algorithm works well on training deep neural networks leads to considerations about the underlying geometry of the related loss function. Recently, we gained a lot of insight into how tuning SGD leads to better or worse generalization properties on a given model and task. Furthermore, we have a reasonably large set of observations that lead to the conclusion that more parameters typically lead to better accuracies as long as the training process is not hampered. In this talk, I will speculatively argue that as long as the model is over-parameterized (OP), all solutions are equivalent up to finite size fluctuations.

We will start by reviewing some of the recent literature on the geometry of the loss function, and how SGD navigates the landscape in the OP regime. Then we will see how to define OP by finding a sharp transition described by the models fitting abilities to its training set. Finally, we will discuss how this critical threshold is connected to the generalization properties of the model, and argue that life beyond this threshold is (more or less) as good as it gets.

Ekin Cubuk

Title: Generalization under noise, adversarial examples, and data augmentation

Abstract: A significant obstacle to machine learning having more impact in science and technology is its inability to generalize. For example, vision models fail catastrophically on small noise distributions that they were not directly trained on, they are not invariant to the small translations and scale changes, and they are very sensitive to adversarial examples. We argue that these three problems can all be explained using a single, geometric framework. Next, we show that the quantitative aspects of model sensitivity are universal across tasks, datasets and models. We will conclude by asking whether the current approaches that try to mitigate this problem, including architecture engineering and data augmentation, can be successful.

Dmitri Chklovskii

Title: Neuroscience-based machine learning

Abstract: Although traditional artificial neural networks were inspired by the brain they resemble biological neural networks only superficially. Successful machine learning algorithms like backpropagation violate fundamental biophysical observations suggesting that the brain employs other algorithms to analyze high-dimensional datasets streamed by our sensory organs. We have been developing neuroscience-based machine learning by deriving algorithms and neural networks from objective functions based on the principle of similarity preservation. Similarity-based neural networks rely exclusively on biologically plausible local learning rules and solve important unsupervised learning tasks such as dimensionality reduction, clustering and manifold learning. In addition, to modeling biological networks, similarity-based algorithms are competitive for Big Data applications. For further information please see our posts on the "Off the convex path" blog: http://www.offconvex.org/2018/12/03/MityaNN2/

Dmitry Krotov - slides

Title: Unsupervised Learning on a Sphere

Abstract: In spite of great success of deep learning a question remains to what extent the computational properties of deep neural networks (DNNs) are similar to those of the human brain. The particularly non-biological aspect of deep learning is the supervised training process with the backpropagation algorithm, which requires massive amounts of labeled data, and a non-local learning rule for changing the synapse strengths. In this talk I will describes a learning algorithm that does not suffer from these two problems. It learns the weights of the lower layer of neural networks in a completely unsupervised fashion. The entire algorithm utilizes local learning rules which have conceptual biological plausibility. The performance of this algorithm is comparable to the performance of standard feedforward networks trained end-to-end with a backpropagation algorithm on simple tasks.

Jack Hidary & Stefan Leichenauer

Title: Neural network classifiers on NISQ-regime processors

Abstract: With the advent of NISQ-regime quantum computers, we investigate the implementation of neural networks as classifiers on these platforms. Farhi and Neven (https://arxiv.org/pdf/1802.06002.pdf ) and others have described how to construct such a classifier and used a modified MNIST training set to demonstrate generalization in a quantum simulator.

In this talk we will review the Farhi-Neven proposal and the specifics of its implementation on a NISQ processor using Cirq code. We conclude with a discussion of avenues of further inquiry.

Dar Gilboa - slides

Title: Signal propagation and gradient stability in the LSTM and GRU

Abstract: The ability to train neural networks on certain tasks depends strongly on the choice of initialization. By considering neural networks with random weights at the infinite width limit, one can calculate the effect of the initialization hyper-parameters on signal propagation into the network and on gradient stability using mean field theory. This work presents such a calculation for the GRU and LSTM - two recurrent cells that are the basis of many state-of-the-art models.

We calculate characteristic time scales of signal propagation from the inputs to the loss and find empirically that these can predict trainability of the network on certain tasks. To control the gradient, we also calculate the moments of the spectrum of the state-to-state Jacobian. Guided by these results, one obtains initialization schemes that provide an orders of magnitude speedup in training on long time sequence tasks, and enable training on tasks where training with a standard initialization fails. Additionally, we observe a beneficial effect on generalization on benchmark tasks. I will also touch on the potential to extend this control to trained networks.

Joint work with Bo Chang, Minmin Chen, Sam Schoenholtz and Jeffrey Pennington.

Felix Draxler - slides

Title: Essentially No Barriers In Neural Network Energy Landscape

Abstract: Training neural networks involves finding minima of a high-dimensional non-convex loss function. Knowledge of the structure of this energy landscape is sparse. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that neural networks have enough capacity for structural changes, or that these changes are small between minima. Also, each minimum has at least one vanishing Hessian eigenvalue in addition to those resulting from trivial invariance.

Vladimir Kirilin - slides

Title: Gradient and hessian properties in logistic regression

Abstract: TBA

Roger Grosse

Title: Scalable Natural Gradient Training of Neural Networks

Abstract: Neural networks have recently driven significant progress in machine learning applications as diverse as vision, speech, and text understanding. Despite much engineering effort to boost the computational efficiency of neural net training, most networks are still trained using variants of stochastic gradient descent. Natural gradient descent, a second-order optimization method, has the potential to speed up training by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large networks because it requires solving a linear system involving the Fisher matrix, whose dimension may be in the millions for modern neural network architectures. The key challenge is to develop approximations to the Fisher matrix which are efficiently invertible, yet accurately reflect its structure.

The Fisher matrix is the covariance of log-likelihood derivatives with respect to the weights of the network. I will present techniques to approximate the Fisher matrix using structured probabilistic models of the computation of these derivatives. Using probabilistic modeling assumptions motivated by the structure of the computation graph and empirical analysis of the distribution over derivatives, I derive approximations to the Fisher matrix which allow for efficient approximation of the natural gradient. The resulting optimization algorithm is invariant to some common reparameterizations of neural networks, suggesting that it automatically enjoys the computational benefits of these reparameterizations. I show that this method gives significant speedups in the training of neural nets for image classification and reinforcement learning.

Zohar Ringel - slides

Title: The role of a layer in supervised learning

Abstract: An important question in the theory of deep learning concerns the role played by each layer in a deep network. Although apriori unclear, the efficiency of transfer learning suggests that a layer's contribution to the network can often be separated from the rest. In this talk, I'll analyze the issue of separability (or non-co-adaptation) from the perspective of layerwise-greedy-optimization. In particular, a set of explicit loss functions would be presented, tailored for each layer in a network. Furthermore, I'll give numerical evidence that layerwise-greedy supervised training using these loss functions reaches Stochastic Gradient Descent (SGD) performance. In contrast, layer-wise greedy supervised training based on an information-bottleneck loss (with a Gaussian regulator) does not reach SGD performance.

Jeffrey Pennington - slides

Title: Statistics of Random Neural Networks

Abstract: Neural networks with random parameters play many important roles in the theory of deep learning. In addition to defining the loss landscape and the class of functions at the beginning of optimization, they can also provide a robust framework for modeling learning dynamics and properties of the learned function st convergence. In this talk, we demonstrate how powerful tools from statistical field theory and random matrix theory can provide insight into random neural networks.

Sam McCandlish - slides

Title: An Empirical Model of Large-Batch Training

Abstract: How quickly can neural networks be trained using large batch sizes? The limits of data parallelism seem differ from domain to domain, ranging from batches of tens of thousands in ImageNet classifiers to batches of millions in RL agents that play the game Dota 2. We describe a simple and easy-to-measure statistic called the gradient noise scale that predicts the largest useful batch size across many applications. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

Katherine Quinn - slides

Title: Visualizing Probabilities: Intensive Principal Component Analysis

Abstract: Obtaining low-dimensional representation of complex, high dimensional data is frustrated by the 'curse of dimensionality'. A trade-off must often be made, between preserving global versus local structure. This problem is exceptionally hard in the case of infinite dimensions, such as for probability distributions. Inspired by replica theory from statistical mechanics, we consider replicas of the system to tune the dimensionality and take the limit as the number of replicas goes to zero. The result is the intensive embedding, which is not only isometric (preserving local distances) but allows global structure to be more transparently visualized. We develop the Intensive Principal Component Analysis (InPCA) and demonstrate clear improvements in visualizations of the Ising model of magnetic spins, a neural network, and the dark energy cold dark matter (ΛCDM) model as applied to the Cosmic Microwave Background. arXiv link: https://arxiv.org/abs/1810.02877

Aristide Baratin

Title: On the Spectral Bias of Neural Networks

Abstract: Neural networks are known to be a class of highly expressive functions able to fit even random input-output mappings with 100% accuracy. In this talk, we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we show that deep networks are biased towards low frequency functions. Intuitively, this property is in line with the observation that over-parameterized networks find simple patterns that generalize across data samples. We also investigate how the shape of the data manifold affects expressivity by showing evidence that, somewhat counter-intuitively, learning high frequencies gets easier with increasing manifold complexity.

Joshua Batson

Title: Noise2Self: Blind Denoising with Self-Supervision

Abstract: Blind denoising is the problem of recovering a signal from noisy measurements with little knowledge of the distribution of the signal or the noise. We reframe denoising as iterated imputation, and prove recovery guarantees in cases where the noise exhibits a conditional independence structure. We use the corresponding objective function to train convolutional neural net denoisers with state-of-the-art performance without ground truth data, to calibrate classical denoising methods such as NL-means and median filters, and to find optimal hyperparameters for models of single-cell gene expression. This generalizes recent work on denoising (Noise2Noise by Lehtinen et al) and classical work on cross-validation for matrix factorization.

Joan Bruna

Title: Geometric Deep Learning in a NonCommutative World

Abstract: Convolutional Neural Networks have transformed Computer Vision thanks to their unique balance between inductive biases and expressivity. Whereas the mathematical picture of this success is still far from complete, the geometrical priors of natural image distributions, the ability of CNNs to perform multi-scale analysis, and the generalization bias of stochastic gradient descent appear as key ingredients.

These ingredients need to be generalized in order to extend the success of deep learning beyond signals defined on regular grids. In this talk, I will describe geometric deep learning models defined on graphs, focusing on three problems arising on several areas of physics which highlight a new modeling challenge due to non-commutative algebraic structures: the problem of efficient representation of unitary operators using Givens Rotations, the Zauner\'s conjecture, and learning under symplectic symmetry.

Alexander Alemi - slides

Title: TherML: A thermodynamics of machine learning

Abstract: Over the years, many connections have been made between machine learning and physics. By viewing learning through a principally representational and information theoretic lens, deep analogies can be made between a wide class of existing and popular modern machine learning objectives (e.g. bayesian neural networks, variational autoencoders, variational information bottleneck classifiers) and Thermodynamics. In this talk, I will motivate this analogy, compare to previous attempts to unify learning and physics, and explore just how deep the analogy goes.

Jonathan Kadmon

Title: High-dimensional manifold classification by multilayered perceptrons

Abstract: The multilayered perceptron is the underlying structure of many of current artificial neural networks. The output of each neuron in the system is a nonlinear function of a weighted sum of the activities in the antecedent layer. For a single layer, the performance and capacity of the network to correctly classify data with arbitrary error margins were found by Gardner three decades ago. Recently, Chung et al. extended to the classification of arbitrary manifolds. However, it is not clear how these results extend to multilayered perceptrons. In this talk, I will present a completely solvable simplified model of feedforward networks, trained to classify high-dimensional manifolds. The layer-by-layer dynamics of the representational manifolds is reduced, using mean-field methods, to a non-linear iterative equation of a single variable. I will show the effects of depth, width, and thresholding on the performance of simple linear classification at the penultimate layer. For finite-size networks, a depth-width tradeoff results in an optimal architecture for a given input statistics. Finally, I will discuss how this theory relates to more practical examples of deep learning.

Gavin Hartnett - slides

Title: Replica Symmetry Breaking in Bipartite Spin Glasses and Neural Networks

Abstract: Some interesting recent advances in the theoretical understanding of neural networks have been informed by results from the physics of disordered many-body systems. Motivated by these findings, this work uses the replica technique to study the mathematically tractable bipartite Sherrington-Kirkpatrick (SK) spin glass model, which is formally similar to a Restricted Boltzmann Machine (RBM) neural network. The bipartite SK model has been previously studied assuming replica symmetry; here this assumption is relaxed and a replica symmetry breaking analysis is performed. The bipartite SK model is found to have many features in common with Parisi's solution of the original, unipartite SK model, including the existence of a multitude of pure states which are related in a hierarchical, ultrametric fashion. As an application of this analysis, the optimal cost for a graph partitioning problem is shown to be simply related to the ground state energy of the bipartite SK model. As a second application, empirical investigations reveal that the Gibbs sampled outputs of an RBM trained on the MNIST data set are more ultrametrically distributed than the input data itself.

Yasaman Bahri

Title: Wide, Deep Neural Networks as Gaussian Processes

Abstract: Modern deep neural networks are seemingly complex systems whose design, optimization, and performance are at present not well-understood even within the context of supervised learning. In the theoretical sciences, a common approach towards tackling complexity is to identify, solve, and perturb about useful and accessible limits of a problem. Motivated by the trend in practice towards networks of increasing size, which are found to perform better than their smaller-sized counterparts, we target an understanding of deep fully-connected and convolutional neural networks in the limit of infinite width by deriving an exact correspondence to certain classes of Gaussian Processes. This also enables a route towards exact Bayesian inference with deep nets, which we compare against gradient-descent based optimization in networks of large, finite width.

Behnam Neyshabur - slides

Title: The Role of Over Parametrization in the Generalization of Neural Networks

Abstract: TBA

Samuel Smith - slides

Title: What can stochastic differential equations tell us about stochastic gradient descent and natural gradient descent?

Abstract: A small number of optimization algorithms are dominant in the deep learning community. Two algorithms in particular stand out; stochastic gradient descent is empirically the most successful, while natural gradient descent has the strongest theoretical motivation. In this talk I show that we can obtain practical insights into both algorithms by treating the parameter update as the discretization of a continuous time stochastic differential equation. We predict that for sufficiently small learning rates, the learning rate, batch size and momentum coefficients are governed by simple scaling rules, and we confirm empirically that these scaling rules are often applicable for practical hyper-parameters. These scaling rules enable us to increase the batch size without sacrificing test accuracy and without expensive hyper-parameter tuning. Finally, we show that the stationary distribution of minibatch natural gradient descent is close to a Bayesian posterior near local minima.

Paul Ginsparg - talk

[Public Lecture]

Title: Rise of the machines: deep learning from Backgammon to Skynet

Abstract: Over the past seven years, there have been significant advances in applications of artificial intelligence, machine learning, and specifically deep learning, to a variety of familiar tasks. From image and speech recognition, self-driving cars, and machine translation, to beating the Go champion, it's been difficult to stay abreast of all the breathless reports of superhuman machine performance. There has as-well been a recent surge in applications of machine learning ideas to research problems in the hard sciences and medicine. I will endeavor to provide an outsider's overview of the ideas underlying these recent advances and their evolution over the past few decades, and project some prospects and pitfalls for the near future.

Uros Seljak - slides

Title: EL2O and MPM: alternatives to Variational Inference and MAP/MLE

Abstract: For machine learning applications MCMC sampling is usually too expensive and one resorts to approximate inference such as maximum likelihood/MAP or variational inference based on KL divergence. I will present two alternatives to these methods and argue what their advantages are.

Andrea Montanari

Title: A mean field view of the landscape of two layers neural networks

Abstract: Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties?

We consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). Form a physics perspective, this is a PDE describing the evolution of the spatial density of a gas, with short range repulsion, evolving in an external potential. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD. We then focus on a particular class of networks, and use tools from optimal transport theory (in particular, displacement convexity) to characterize convergence rates.

Based on joint work with Song Mei, Phan-Minh Nguyen, Adel Javanmard and Marco Mondelli

Maxime Gabella - slides

Title: Introduction to Topological Data Analysis

Abstract: The main goal of Topological Data Analysis (TDA) is to derive insights from the shape of data. Over the last decade, TDA has adapted approaches from algebraic topology to the study of datasets represented as point clouds in the space of features. I will give a pedagogical review of some of the main TDA tools, such as persistent homology and the “Mapper” graph. I will then present some ideas on how TDA can be used to monitor the training of neural networks.

Alex Cole - slides

Title: Local Deformations and Global Features in Persistent Homology, Machine Learning, and Physics

Abstract: Persistent homology, which underlies the relatively new field of Topological Data Analysis, computes the multiscale topology of a data set by using a sequence of discrete complexes. Roughly speaking, persistent homology allows us to compute the “shape” of a data set. In this talk we use persistent homology as a toy model for more sophisticated machine learning classification techniques. Persistent homology can be regarded as an algorithm for classifying noisy, discretized manifolds by their topological invariants. One attractive aspect of persistent homology is its stability theorem: given perturbations to the data set, the change in the output is small. We attempt to connect this notion of stability with the machine learning task of classification. In this setting, a deformation changes one representative of a class to another representative of the same class. More broadly, we explore how deformations and deformation-invariant quantities play roles in (applied) topology, machine learning, and physics. We also briefly discuss how persistent homology can be fruitfully applied to the string landscape, cosmology, and phase transitions.

Jesse Thaler - slides

Title: Collision Course: Particle Physics as a Machine Learning Testbed

Abstract: Modern machine learning has had an outsized impact on many scientific fields, and particle physics is no exception. What is special about particle physics, though, is the vast amount of theoretical and experimental knowledge that we already have about many problems in the field. Using the Large Hadron Collider as an example, I explain the fascinating interface between theoretical principles and machine learning architectures. As one example, by trying to catalog the space of all possible collider measurements, we (re)discovered technology relevant for self-driving cars. As another example, by trying to partition collider events into well-defined categories, we (re)discovered technology relevant for analyzing documents. These examples and others suggest that particle physics is a rich domain for further research in artificial intelligence.

David Spergel

Title: ML for sub-grid physics

Abstract: TBA

Jascha Sohl-Dickstein - slides

Title: Parameter Estimation in Intractable Probabilistic Models by Minimum Probability Flow Learning

Abstract: Fitting energy based models to data is often extremely difficult, due to the general intractability of the partition function. We propose a new parameter fitting method which bypasses this difficulty by considering only small perturbations from the data distribution toward the model distribution. Parameter estimation using this method is demonstrated for several probabilistic models, including an Ising spin glass where it outperforms other techniques by at least an order of magnitude in convergence time with lower error in the recovered coupling parameters. The application of this method to pattern storage in a Hopfield associative memory is also discussed.

Adam Scherlis

Title: The Goldilocks zone: Towards better understanding of neural network loss

Abstract: We explore the loss landscape of fully-connected and convolutional neural networks using random, low-dimensional hyperplanes and hyperspheres. Evaluating the Hessian, H, of the loss function on these hypersurfaces, we observe 1) an unusual excess of the number of positive eigenvalues of H, and 2) a large value of Tr(H) / ||H|| at a well defined range of configuration space radii, corresponding to a thick, hollow, spherical shell we refer to as the Goldilocks zone. We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the ReLU and tanh non-linearities, and a similar effect for convolutional networks. Using our observations, we demonstrate a close connection between the Goldilocks zone, measures of local convexity/prevalence of positive curvature, and the suitability of a network initialization. We show that the high and stable accuracy reached when optimizing on random, low-dimensional hypersurfaces is directly related to the overlap between the hypersurface and the Goldilocks zone, and as a corollary demonstrate that the notion of intrinsic dimension is initialization-dependent. We note that common initialization techniques initialize neural networks in this particular region of unusually high convexity/prevalence of positive curvature, and offer a geometric intuition for their success. Furthermore, we demonstrate that initializing a neural network at a number of points and selecting for high measures of local convexity such as Tr(H) / ||H||, number of positive eigenvalues of H, or low initial loss, leads to statistically significantly faster training on MNIST. Based on our observations, we hypothesize that the Goldilocks zone contains an unusually high density of suitable initialization configurations.

Anders Andreassen - slides

Title: JUNIPR: a Framework for Unsupervised Machine Learning in Particle Physics

Abstract: In applications of machine learning to particle physics, a persistent challenge is how to go beyond discrimination to learn about the underlying physics. In this talk, we will present a new framework: JUNIPR, Jets from UNsupervised Interpretable PRobabilistic models, which uses unsupervised learning to learn the intricate high-dimensional contours of the data upon which it is trained, without reference to pre-established labels. In order to approach such a complex task, JUNIPR is structured intelligently around a leading-order model of the physics underlying the data. In addition to making unsupervised learning tractable, this design actually alleviates existing tensions between performance and interpretability. Applications to discrimination, data-driven Monte Carlo generation and reweighting of events will be discussed.

David Shih - slides

Title: Searching for New Physics with Deep Autoencoders

Abstract: The search for new physics Beyond the Standard Model (BSM) has been ongoing for over 40 years. Despite knowing that new physics is out there, and despite having many well-motivated theoretical expectations for what the new physics should look like, so far no concrete hints of it have shown up in experiments. There is growing recognition that we need new approaches and ideas to search for the unexpected in the data.

I will review the specific nature of the problem, focusing on the Large Hadron Collider (LHC), the highest-energy particle collider ever built. Then I will introduce a potentially powerful new method of searching for new physics at the LHC, using autoencoders and unsupervised deep learning. I will show how an autoencoder trained on Standard Model backgrounds can "discover" anomalous new physics events, despite having never seen them before, using the reconstruction error as an anomaly threshold. This opens up the exciting possibility of training directly on actual data to discover new physics with no prior expectations or theory prejudice.

Lechao Xiao - slides

Title: Why ConvNets generalize better than fully-connected networks? An explanation via a mean field approach

Abstract: In many computer vision tasks, convolutional neural networks outperform fully-connected networks by a large margin. In this talk, we will present a possible mathematical explanation behind it via a mean field approach. First, we demonstrate that CNNs can propagate a multi-dimensional signal along different Fourier modes. In contrast, fully-connected networks can only propagate a scalar signal. Second, we will present some empirical evidence that these Fourier modes play an important role for the test performance of CNNs. Finally, we will discuss one challenge of training deep CNNs coming from the uncertainty principle of Fourier Transform and present a solution to it.

Peter Eckersley - slides

Title: Impossibility and Uncertainty Theorems in AI value alignment (or why your AI should not have a utility function)

Abstract: Utility functions or their equivalents (value functions, objective functions, loss functions, reward functions, preference orderings) are a central tool in most current machine learning systems. These mechanisms for defining goals and guiding optimization run into practical and conceptual difficulty when there are independent, multi-dimensional objectives that need to be pursued simultaneously and cannot be reduced to each other. Ethicists have proved several impossibility theorems that stem from this origin; those results appear to show that there is no way of formally specifying what it means for an outcome to be good for a population without violating strong human ethical intuitions (in such cases, the objective function is a social welfare function). We argue that this is a practical problem for any machine learning system (such as medical decision support systems or autonomous weapons) or rigidly rule-based bureaucracy that will make high stakes decisions about human lives: such systems should not use objective functions in the strict mathematical sense.

We explore the alternative of using uncertain objectives represented for instance as partially ordered preferences, or as probability distributions over total orders. We show that previously known impossibility theorems can be transformed into uncertainty theorems in both of those settings, and prove lower bounds on how much uncertainty is implied by the impossibility results. We close by proposing two conjectures about the relationship between uncertainty in objectives and severe unintended consequences from AI systems.

Masoud Mohseni

Title: Approximate optimization and inference with tensor networks

Abstract: We devise a deterministic classical algorithm to reveal the structure of low energy spectrum for certain spin-glass systems that encode classical discrete optimization problems. We employ tensor networks to represent probability distributions of all possible configurations. We then develop efficient techniques for approximately extract the relevant information from the networks for a class of quasi-two-dimensional Ising Hamiltonians. To this end, we apply a branch and bound approach over marginal probability distributions by approximately evaluating tensor contractions. Our approach identifies configurations with the largest Boltzmann weights corresponding to low energy states. We discover spin-glass droplet structures at finite temperatures, by exploiting local nature of the problems.

This droplet finding algorithm naturally encompass sampling from high quality solutions within a given approximation ratio. It is, thus, established that tensor networks techniques can provide profound insight into the structure of large low-dimension spin-glass problems, with ramifications both for machine learning and noisy intermediate-scale quantum devices. At the same time, limitations of our approach highlight alternative directions to establish quantum speed-up and possible quantum supremacy experiments

Michael Douglas - slides

Title: Machine Learning by a String (Theorist)

Abstract: An important frontier in AI is to integrate symbolic and differentiable methods. Arguably, the best understood problem of this type is to learn a representation of a graph by continuous data such as an embedding or a neural network. We survey a variety of techniques for this and test them on models of random graphs, some originating in physics such as the stochastic block model, and others intended as toy models of graphs arising in biology, in string theory and in automated theorem verification. If time permits I will also discuss the analysis of large graphs arising from the verification of the Feit-Thompson theorem in Coq, facilitated by the GamePad environment of Huang et al.

Daniel Park - slides

Title: Optimal SGD Hyperparameters for Fully-connected Networks

Abstract: We train fully-connected networks on classification tasks and find an empirical formula that relates the optimal hyper-parameters of SGD (which maximize final test accuracy) to the initialization distribution of the parameters.

Work with Sam Smith, Jascha Sohl-dickstein and Quoc Le.

Samuel Ocko - slides

Title: Diversity versus Depth in Neural Representations

Abstract: A trend in vertebrate visual processing is that retinas of larger-brained animals encode scenes quasi-linearly, while retinas of small-brained animals perform sophisticated non-linear computations extracting features directly relevant to behavior. We have recently observed this trend by training neural networks with architectures simulating those of vertebrates; a deep “brain” leads to linear, non-lossy representations at the retinal bottleneck, while a shallow “brain” yields highly non-linear computations. This suggests a deeper underlying principle at work. We hypothesize that the bottleneck forces a tradeoff between the diversity and depth of representations that a single network of this architecture is capable of computing. By varying bottleneck architectures and tasks, we are able to probe the relative benefits of diversity vs. depth of representations.

Miles Stoudenmire - slides

Title: "Wavefunctions" of data: applying tensor network methods from physics to machine learning

Abstract: TBA