Talk Abstracts

Christos Papadimitriou

On biologically plausible artificial neural networks

TBD


Phil Long + Hanie Sedghi

Size-free generalization bounds for convolutional neural networks

We prove bounds on the generalization error of convolutional networks. The bounds are in terms of the training loss, the number of parameters, the Lipschitz constant of the loss and the distance from the weights to the initial weights. They are independent of the number of pixels in the input, and the height and width of hidden feature maps. We present experiments with CIFAR-10 and a scaled-down variant, along with varying hyperparameters of a deep convolutional network, comparing our bounds with practical generalization gaps.

Rina Panigrahy

How does the mind store information in memory?

We present a mechanism to compute a sketch (succinct summary) of how a complex modular deep network processes its inputs. The sketch summarizes essential information about the inputs and outputs of the network and can be used to quickly identify key components and summary statistics of the inputs. Furthermore, the sketch is recursive and can be unrolled to identify sub-components of these components and so forth, capturing a potentially complicated DAG structure. These sketches erase gracefully; even if we erase a fraction of the sketch at random, the remainder still retains the `high-weight' information present in the original sketch. The sketches can also be organized in a repository to implicitly form a `knowledge graph'; it is possible to quickly retrieve sketches in the repository that are related to a sketch of interest; arranged in this fashion, the sketches can also be used to learn emerging concepts by looking for new clusters in sketch space. Finally, in the scenario where we want to learn a ground truth deep network, we show that augmenting input/output pairs with these sketches can theoretically make it easier to do so.

Raman Arora

Understanding the inductive bias due to dropout

We investigate the role of explicit algorithmic regularization due to dropout in linear regression, matrix completion, and deep neural networks trained with square loss. We show that dropout at the final hidden layer yields a data-dependent (weighted) trace norm regularization in matrix problems and path regularization in neural networks. We introduce a novel adaptive dropout procedure that induces data-independent max ell_p regularization, for any p — this, we believe is an interesting algorithmic approach to study in its own right. Finally, we conclude with some data-dependent generalization bounds and show that the proposed bounds correlate well with real-world experiments.

Vitaly Feldman

Does Learning Require Memorization? A Short Tale about a Long Tail

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set. This phenomenon is referred to as data interpolation or, informally, as memorization of the training data. The question of why such algorithms generalize well to unseen data is not adequately addressed by the standard theoretical frameworks and, as a result, significant theoretical and experimental effort has been devoted to understanding the properties of such algorithms.

We provide a simple model for prediction problems in which interpolating the dataset is necessary for achieving close-to-optimal generalization error. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and the frequencies of these subpopulations are chosen from some prior. The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. To the best of our knowledge, this is the first general framework that demonstrates statistical benefits of plain memorization for learning. Our results also have concrete implications for the cost of ensuring differential privacy in learning.

Leon Bottou

Learning Representations Using Causal Invariance

Learning algorithms often capture spurious correlations present in the training data distribution instead of addressing the task of interest. Such spurious correlations occur because the data collection process is subject to uncontrolled confounding biases. Suppose however that we have access to multiple datasets exemplifying the same concept but whose distributions exhibit different biases. Can we learn something that is common across all these distributions, while ignoring the spurious ways in which they differ? This can be achieved by projecting the data into a representation space that satisfy a causal invariance criterion. This idea differs in important ways from previous work on statistical robustness or adversarial objectives. Similar to recent work on invariant feature selection, this is about discovering the actual mechanism underlying the data instead of modeling its superficial statistics.

Jieming Mao

On the Power of Interactivity in Local Differential Privacy

We study the power of interactivity in local differential privacy. First, we focus on the difference between fully interactive and sequentially interactive protocols. Sequentially interactive protocols may query users adaptively in sequence, but they cannot return to previously queried users. The vast majority of existing lower bounds for local differential privacy apply only to sequentially interactive protocols, and before this paper it was not known whether fully interactive protocols were more powerful. We resolve this question. First, we classify locally private protocols by their compositionality, the multiplicative factor k≥1 by which the sum of a protocol's single-round privacy parameters exceeds its overall privacy guarantee. We then show how to efficiently transform any fully interactive k-compositional protocol into an equivalent sequentially interactive protocol with an O(k) blowup in sample complexity. Next, we show that our reduction is tight by exhibiting a family of problems such that for any k, there is a fully interactive k-compositional protocol which solves the problem, while no sequentially interactive protocol can solve the problem without at least an Ω~(k) factor more examples. We then turn our attention to hypothesis testing problems. We show that for a large class of compound hypothesis testing problems --- which include all simple hypothesis testing problems as a special case --- a simple noninteractive test is optimal among the class of all (possibly fully interactive) tests.

Nisheeth Vishnoi

Physics-inspired Algorithms: Hamiltonian Monte Carlo for Sampling

In understanding natural systems over hundreds of years, physicists have developed a wealth of dynamics and viewpoints. Some of these methods, when viewed (and abstracted) through the lens of computation, could lead to new optimization and sampling techniques for problems in machine learning, statistics, and theoretical computer science. I will present a recent example from my research on such interactions between physics and algorithms – a Hamiltonian dynamics inspired algorithm for sampling from continuous distributions.

Manish Purohit

Three Flavors of Predictions in Online Algorithms

In this work we study the problem of using machine-learned predictions to improve performance of online algorithms. We consider two classical problems, ski rental and non-clairvoyant job scheduling, and obtain new online algorithms that use predictions to make their decisions. These algorithms are oblivious to the performance of the predictor, improve with better predictions, but do not degrade much if the predictions are poor.

Sergei Vassilvitskii

Learning in Algorithms

Classic algorithm design focuses on optimizing worst case complexity, resulting in algorithms that perform well for any possible input. In many applications, however, the typical input is not just far from worst case, but has some predictable characteristics. We give an overview of recent results that incorporate machine learned predictions into algorithm design. Overall, these approaches improve the algorithms’ performance when predictions are correct, for instance, lowering competitive ratios of online algorithms, and space complexity of streaming algorithms; and nearly match the best known worst case performance bounds, when the predictions are wrong.

Claudio Gentile

On Active Learning with Zooming

We present recent research activity in online importance-weighted active learning. We consider an active learning algorithm that adaptively zooms into the current version space, hence reducing the best-in-class error as compared to the starting hypothesis set. We show that this algorithm admits favorable theoretical guarantees, and that these guarantees can be further improved if combined with the properties of disagreements among hypotheses. If time permits, we also show experimental results on multiple datasets demonstrating that the proposed algorithm achieves far better test performances than standard importance-weighted active learning algorithms given the same amount of labeling budget.

Joint with C. Cortes, G. DeSalvo, M. Mohri, and N. Zhang

Steve Hanneke

Toward Optimal Agnostic Active Learning

One of the oldest and most important questions in the theory of active learning is characterizing the optimal sample complexity of agnostic active learning. While the best known lower bound is always smaller than the sample complexity of passive learning, the best known upper bounds are sometimes no better than passive learning. This gap has persisted in the literature for over a decade, and it is conjectured that the lower bound is the correct answer: that is, that agnostic active learning can always have smaller sample complexity than passive learning. Proving this necessarily requires developing new approaches to agnostic active learning. This talk describes recent progress toward that goal.

Karthik Sridharan

Towards Building Non-polarizing Recommendation Algorithms

Machine learning algorithms are used to predict preferences of users and recommend contents to users. Users in turn get influenced by other users and recommendations or choices of contents they are provided with. Such influence can often have an extremizing effect of polarizing users. Current ML based recommender systems don’t account for their polarizing effects. In this talk I will present some of our initial attempts at building theory and algorithm design principles for building machine learning systems that not only aim to predict or recommend with high accuracy but also aim to not further polarize its users. Specifically, we assume that the users of the system are interconnected to each other via a social network and that the machine learning algorithm has access to the structure of this social network. We then build on and extend existing opinion formation models to model how users get influenced by both recommendations and their friends in the social network. Finally we provide an algorithm to aggregate recommendations that not only aim for high accuracy but simultaneously aims to reduce a natural measure of polarization we propose. The prediction problem with the polarization dynamics, essentially boils down to a problem of online learning with non-linear dynamics for how states (opinions of users) evolve. We show that under our model of opinion formation dynamics (that subsumes existing model for opinion dynamics) our recommendation algorithm provably has low polarization effect and obtains favorable performance guarantee when compared with the best non-polarizing mixture of recommendations.

Brendan McMahan and Peter Kairouz

Open problems in federated learning

federated.withgoogle.com/#learn

Ananda Theertha Suresh

Agnostic Federated Learning

A key learning scenario in large-scale applications is that of federated learning, where a centralized model is trained based on data originating from a large number of clients. We argue that, with the existing training and inference, federated models can be biased towards different clients. Instead, we propose a new framework of agnostic federated learning, where the centralized model is optimized for any target distribution formed by a mixture of the client distributions. We further show that this framework naturally yields a notion of fairness. We present data-dependent Rademacher complexity guarantees for learning with this objective, which guide the definition of an algorithm for agnostic federated learning. We also give a fast stochastic optimization algorithm for solving the corresponding optimization problem, for which we prove convergence bounds, assuming a convex loss function and a convex hypothesis set. We further empirically demonstrate the benefits of our approach in several datasets. Beyond federated learning, our framework and algorithm can be of interest to other learning scenarios such as cloud computing, domain adaptation, drifting, and other contexts where the training and test distributions do not coincide.

Kunal Talwar

On the Error Resistance of Hinge Loss Minimization

Commonly used classification algorithms in machine learning, such as support vector machines, minimize a convex surrogate loss on training examples. In practice, these algorithms are surprisingly robust to errors in the training data. In this work, we identify a set of conditions on the data under which such surrogate loss minimization algorithms provably learn the correct classifier. This allows us to establish, in a unified framework, the robustness of these algorithms under various models on data as well as error. For example, we show that if the data is linearly classifiable with a slightly non-trivial margin (i.e. a margin at least $C\div\sqrt{d}$ for $d$-dimensional unit vectors), and the class-conditional distributions are near isotropic and logconcave, then surrogate loss minimization has negligible error on the uncorrupted data even when a constant fraction of examples are adversarially mislabeled.

Pranjal Awasthi

On robustness to adversarial examples and polynomial optimization

TBD

Andrew Cotter

On Making Stochastic Classifiers Deterministic

Stochastic classifiers arise in a number of machine learning problems, and have become especially prominent of late, as they often result from constrained optimization problems, e.g. for fairness, churn, or custom loss optimization. Despite their utility, the inherent randomness of stochastic classifiers may make them undesirable for a variety of practical reasons. In this talk, I'll attempt to answer the theoretical question of how well a stochastic classifier can be approximated by a deterministic one, and compare several possible approaches, proving lower and upper bounds."

Shivani Agarwal

Multiclass Learning with General Losses: What is the Right Output Coding and Decoding?

Multiclass learning with 0-1 loss is fairly well understood; however, in practice, many real-world problems require performance to be measured via a more complex loss. What is the right way to design principled learning algorithms for multiclass problems with general losses? Theoretically, an elegant and rigorous way to design statistically consistent learning algorithms is via the design of convex calibrated surrogate losses. In practice, an approach that is often favored is that of output coding, which reduces multiclass learning to a set of binary classification problems. In this talk, we will discuss recent progress in bringing together these approaches under a unifying lens.

This is joint work with Harish G. Ramaswamy, Balaji S. Babu, Ambuj Tewari, and Bob Williamson.

Thodoris Lykouris

Adversarial robustness for stochastic bandit learning

Abstract: Bandit learning has proved to be a useful paradigm towards effective online decision-making in a variety of applications including online advertising, recommender systems, clinical trials, etc. However, classical approaches require either that data are i.i.d. (stochastic bandits) or provide completely worst-case guarantees (adversarial bandits). In this talk, I will briefly review approaches to go beyond these two strong assumptions. I will then discuss recent progress towards addressing settings that are not completely i.i.d. but still exhibit well-behaved randomness. The talk will be mostly based on a working paper with Vahab Mirrokni and Renato Paes Leme.

Haipeng Luo

Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

We develop the first general semi-bandit algorithm that simultaneously achieves O(log T) regret for stochastic environments and O( √ T) regret for adversarial environments without prior knowledge of the regime or the number of rounds T. The leading problem-dependent constants of our bounds are not only optimal in a certain worstcase sense studied previously, but also optimal for two concrete instances of semi-bandit problems. Our algorithm and analysis extend the recent work of Zimmert & Seldin (2019) for the special case of multi-armed bandits, but importantly requires a novel hybrid regularizer designed specifically for semi-bandit. Experimental results on synthetic data show that our algorithm indeed performs well over different environments. Finally, we provide a preliminary extension of our results to the full bandit feedback.

Alex Slivkins

Incentivized exploration

In a wide range of scenarios, individual decision-makers consume information revealed by the previous decisions, and produce information that may help in the future decisions. Each decision-maker would individually prefer to "exploit" (optimize the current reward), but would like the previous agents to "explore" (try out various alternatives). A social planner, by means of carefully designed information disclosure, can incentivize the agents to balance exploration and exploitation so as to maximize social welfare. Various aspects of this topic have been studied; we survey some of the recent progress.

Joint work with Nicole Immorlica (MSR-NYC), Jieming Mao (Google), and Steven Wu (University of Minnesota).

Naman Agarwal + Karan Singh

Provably Robust Control

We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs.

Akshay Krishnamurthy

Provably Efficient Reinforcement Learning with Rich Observations

We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. Under certain identifiability assumptions, we demonstrate how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps---where previously decoded latent states provide labels for later regression problems---and use it to construct good exploration policies. We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems. Our method exponentially improves over Q-learning with naïve exploration, even when Q-learning has cheating access to latent states.

Jon Schneider + Renato Paes Leme

Contextual Search via Intrinsic Volumes

We study the problem of contextual search, a multidimensional generalization of binary search that captures many problems in contextual decision-making. In contextual search, a learner is trying to learn the value of a hidden vector v∈[0,1]d. Every round the learner is provided an adversarially-chosen context ut∈Rd, submits a guess pt for the value of ⟨ut,v⟩, learns whether pt<⟨ut,v⟩, and incurs loss ℓ(⟨ut,v⟩,pt) (for some loss function ℓ). The learner's goal is to minimize their total loss over the course of T rounds.

We present an algorithm for the contextual search problem for the symmetric loss function ℓ(θ,p)=|θ−p| that achieves Od(1) total loss. We present a new algorithm for the dynamic pricing problem (which can be realized as a special case of the contextual search problem) that achieves Od(loglogT) total loss, improving on the previous best known upper bounds of Od(logT) and matching the known lower bounds (up to a polynomial dependence on d). Both algorithms make significant use of ideas from the field of integral geometry, most notably the notion of intrinsic volumes of a convex set. To the best of our knowledge this is the first application of intrinsic volumes to algorithm design.

Satyen Kale

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.

Dylan Foster

The Complexity of Making the Gradient Small in Stochastic Optimization

We give nearly matching upper and lower bounds on the oracle complexity of finding ϵ-stationary points (∥∇F(x)∥≤ϵ) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning.

Our upper bounds are based on extensions of a recent "recursive regularization" technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff.

Francesco Orabona

Momentum-Based Variance Reduction in Non-Convex SGD

Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new variance reduction algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less tuning of hyperparameters. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses, STORM finds a point x with E[|| nabla F(x) ||] ≤ O( 1/sqrt(T)+sigma^(1/3)/T^(1/3) ) in T iterations with sigma^2 the variance in the gradients, matching the best known rate but without requiring knowledge of sigma.

Renato Paes Leme + Vahab Mirrokni

Adaptivity in Submodular Optimization

TBD

Rong Ge

Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets

Mode connectivity (Garipov et al., 2018; Draxler et al., 2018) is a surprising phenomenon in the loss landscape of deep nets. Optima—at least those discovered by gradient-based optimization—turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of well-trained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.

Stephanie Jegelka

How Powerful are Graph Neural Networks?

Graph Neural Networks have become a popular tool for learning graph representations.

However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational power and limitations. In this work, we study the discriminative power of message passing type networks as a function of architecture choices. We show that certain popular architectures cannot learn to distinguish certain simple graph structures. Our results also imply choices that make a model in this class maximally expressive.

This is joint work with Keyulu Xu, Weihua Hu, and Jure Leskovec.