Modern Trends in Nonconvex Optimization for Machine Learning

Invited Speakers

Yoshua Bengio (U Montreal)

Title: On stochastic gradient descent, flatness and generalization

Abstract: Why is it that an algorithm as simple as back-propagation with stochastic gradient descent (SGD) is so successful to train deep neural networks? One the one hand, the optimization problem in very high-dimensional problems may be easier than initially thought, with bad local minima probably less of an issue than was feared for many decades. On the other hand, back-propagation, even when apparently crippled with lots of noise, still leads to very fast convergence, and much faster convergence than perturbation-based methods. SGD variants end up having surprisingly good properties, both in terms of optimization and in terms of generalization. SGD focusses first on what the training examples have in common, on the factors that can explain many examples, and only much later on the exceptions, outliers and noisy examples which would otherwise lead to overfitting. In fact, it appears that the noise arising from smaller minibatches and gradient covariances may help both optimization (fast convergence) and generalization. An open question is how to resolve the apparent better generalization associated with flatter minima and the fact that reparametrization can lead to the same generalization for higher curvature regions as for lower curvature regions.

Bio: Yoshua Bengio (computer science PhD, 1991, McGill U; post-docs at MIT and Bell Labs, computer science professor at U. Montréal since 1993): he authored three books, over 500 publications (h-index 116, over 120k citations), mostly in deep learning, holds a Canada Research Chair in Statistical Learning Algorithms, is Officer of the Order of Canada, recipient of the Marie-Victorin Quebec Prize 2017, he is a CIFAR Senior Fellow and co-directs its Learning in Machines and Brains program. He is scientific director of the Montreal Institute for Learning Algorithms (MILA), currently the largest academic research group on deep learning. He is on the NIPS foundation board (previously program chair and general chair), co-created the ICLR conference (specialized in deep learning) and is scientific co-director of IVADO. He pioneered deep learning and his goal is to contribute to uncover the principles giving rise to intelligence through learning, as well as favour the development of AI for the benefit of all.

Coralia Cartis (Oxford)

Title: Stochastic variants of nonconvex optimization methods, with complexity guarantees

Abstract: We present global rates of convergence/complexity for a general class of methods for nonconvex smooth optimization that include linesearch, trust-region and regularisation strategies, but that allow inaccurate, probabilistic problem information.This framework subsumes certain stochastic gradient analyses and derivative-free techniques based on random sampling of function values and also applies to stochastic optimization. We show that in terms of the order of the accuracy, the evaluation complexity of such methods is the same as their counterparts that use deterministic accurate models.

On the way, I will review some existing results regarding cubic regularisation methods. This work is joint with Katya Scheinberg (Lehigh).

Bio: Coralia Cartis (BSc Mathematics, Babesh-Bolyai University, Romania; PhD Mathematics, University of Cambridge) has joined the Mathematical Institute at Oxford and Balliol College in 2013 as Associate Professor in Numerical Optimization. Previously, she worked as a research scientist at Rutherford Appleton Laboratory, and associate professor in the School of Mathematics, University of Edinburgh. Her research addresses the analysis and development of (hopefully) useful algorithms for solving difficult optimization problems - nonconvex, with and without derivatives, local and global. She has also worked on compressed sensing and parameter estimation for climate modelling.

Dmitriy Drusvyatskiy (U Washington)

Title: Convergence rates of stochastic algorithms for nonsmooth nonconvex problems

Abstract: Stochastic iterative algorithms, such as the stochastic subgradient, Gauss-Newton, and proximal point methods, are the bedrock of large-scale computation. Though these methods are routinely and successfully used in practice, their guarantees are not yet well understood for nonsmooth nonconvex formulations. In this talk, I will try to address this disparity by focusing on the class of so-called "weakly convex problems". These are the problems that can be locally convexified by adding a sufficiently large quadratic. Weakly convex functions provide an appealing abstraction for structured nonsmooth nonconvex models that are common in data science. I will show that when applied to such problems, the three algorithms above drive a natural stationarity measure to zero at a controlled rate. One noteworthy consequence, deviating from the current literature, is that minibatching is not necessary to ensure convergence of stochastic proximal methods. Not all interesting functions are weakly convex, however. Neural networks with nonsmooth activation functions do not fit in the above framework. Nonetheless, I will end the talk by showing that for a virtually exhaustive class of optimization problems arising in data science, every limit point of the stochastic subgradient method is at the very least first-order stationary.

This is joint work with Damek Davis, Sham Kakade, and Jason D. Lee.

Bio: Dmitriy Drusvyatskiy received his PhD from the Operations Research and Information Engineering department at Cornell University in 2013, followed by a post-doctoral appointment in the Combinatorics and Optimization department at University of Waterloo, 2013-2014. He joined the Mathematics department at University of Washington as an Assistant Professor in 2014. Dmitriy’s research broadly focuses on designing and analyzing algorithms for large-scale nonsmooth optimization problems, often motivated by applications in data science. Dmitriy has received a number of awards, including a finalist citation for the Tucker Prize (2015), Air Force Office of Scientific Research (AFOSR) Young Investigator Program (YIP) Award, and NSF CAREER. Dmitriy is currently a co-PI of a new NSF funded Transdisciplinary Research in Principles of Data Science (TRIPODS) institute at University of Washington.

Elad Hazan (Princeton)

Title: Adaptive Regularization Strikes Back

Abstract: Stochastic gradient descent is the workhorse behind the recent deep learning revolution. This simple and age-old algorithm has been supplemented with a variety of enhancements to improve its practical performance, and sometimes its theoretical guarantees. Perhaps the most controversial and least-understood such enhancements are adaptive preconditioning methods. Since their inception, adaptive regularization methods came in diagonal and full-matrix variants. However, only the former have enjoyed widespread adoption in training large-scale deep models. This is due to the computational overhead of manipulating a full matrix in high dimension.

In this talk we show how to make full-matrix adaptive regularization practical and useful. We'll describe GGT, a truly scalable full-matrix adaptive optimizer. At the heart of this algorithm is an efficient, GPU-friendly method for multiplying a vector by the inverse square root of a low-rank matrix. We show that GGT converges to first-order local minima and characterize the rate compared to SGD, providing a first rigorous theoretical analysis of adaptive regularization in non-convex optimization, and describe preliminary experimental results.

Joint work with Naman Agarwal, Brian Bullins, Xinyi Chen, Karan Singh, Cyril Zhang and Yi Zhang.

Bio: Elad Hazan is a professor of computer science at Princeton university. His research focuses on the design and analysis of algorithms for basic problems in machine learning and optimization. Amongst his contributions are the co-development of the AdaGrad algorithm, and the first sublinear-time algorithms for convex optimization. He is the recipient of (twice) the IBM Goldberg best paper award in 2012 for contributions to sublinear time algorithms for machine learning, and in 2008 for decision making under uncertainty, a European Research Council grant , a Marie Curie fellowship, Google Research Award, and the Bell Labs prize. He served on the steering committee of the Association for Computational Learning and has been program chair for COLT 2015.

Sham Kakade (U Washington)

Title: Provably Correct Automatic Subdifferentiation

Abstract: Machine learning libraries for Automatic Differentiation (AD), such as TensorFlow and PyTorch, have had a transformative effect on applied machine learning and optimization. Their effectiveness is based on the "cheap gradient principle" (Griewank, 2008) --- the computational cost of computing a d-dimensional vector of partial derivatives of a scalar function is nearly the same as that of simply computing the scalar function itself; it is what allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The striking Baur-Strassen Theorem (1983) formalizes this, showing that this constant is a (dimension free) factor of 5 (in the algebraic circuit complexity model).

The current state of affairs is markedly different with regards to automatically computing subgradients due to the chain rule for differentiation no longer being valid when composing non-smooth functions (such as the ReLu). More troubling is that all widely used ML libraries --- TensorFlow, PyTorch, AutoGrad, MxNet, etc. --- do not correctly compute subgradients, which can be observed even on simple examples. This work considers the question: is there a "cheap subgradient principle"? Our main result shows that, under certain restrictions on our library of nonsmooth functions (standard in nonlinear programming), provably correct subgradients can be computed at a computational cost that is within a (dimension-free) factor of 6 of the cost of computing the scalar function itself.

This is joint work with: Dmitriy Drusvyatskiy and Jason D. Lee.

Bio: Sham Kakade is a Washington Research Foundation Data Science Chair, with a joint appointment in both the Computer Science and Engineering and Statistics departments at the University of Washington. He works on both theoretical and applied questions in machine learning and artificial intelligence, focusing on designing (and implementing) both statistically and computationally efficient algorithms. Amongst his contributions are establishing a number principled approaches and frameworks in reinforcement (including the natural policy gradient and conservative policy iteration) and in the co-development of spectral algorithms for statistical estimation (including provably efficient statistical estimation for mixture models, topic models, hidden markov models, and overlapping communities in social networks). His more recent contributions are on faster algorithms for nonconvex optimization as well as stochastic optimization (and stochastic approximation). He is the recipient of the IBM Goldberg best paper award (in 2007) for contributions to fast nearest neighbor search and the best paper, INFORMS Revenue Management and Pricing Section Prize (2014). He has been program chair for COLT 2011.

Sham completed his Ph.D. at the Gatsby Computational Neuroscience Unit at University College London, advised by Peter Dayan and has done a postdoc at UPenn under the supervision of Michael Kearns. He has been at Microsoft Research, New England, the Department of Statistics, Wharton, UPenn, and the Toyota Technological Institute at Chicago.

Sergey Levine (UC Berkeley)

Title: Meta-Learning of Gradient-Based Learners

Abstract: Handling small training sets is often considered to be one of the major weaknesses of deep learning methods, for both supervised learning tasks and reinforcement learning problems. Meta-learning aims to address this by learning to learn: using large amounts of past experience on other learning problems to acquire a more efficient learning procedure that can learn new tasks with just a handful of examples or samples. Initially, most deep meta-learning methods made use of recurrent models that directly ingest the entire task training set, essentially "learning" the fast learning algorithm in the connections of a large neural network. However, this procedure can be unpredictable and brittle in practice, since the learned adaptation procedure blurs the line between evaluation and learning, and usually does not provide a convergent learning procedure, nor a mechanism for incorporating larger amounts of data as they become available. Model-agnostic meta-learning (MAML) is a technique for meta-learning that instead optimizes the initial parameter values of a standard parametric model (e.g., a neural network) such that this network can adapt rapidly to new learning tasks. While the premise of this method may at first seem surprising -- that simply choosing the initial parameters of a deep network is enough to enable fast, sample-efficient learning -- I will discuss how this procedure in fact admits a universality result, which means that it is as expressive as any alternative method (e.g., a recurrent network), and that it can achieve good results in practice. I will also discuss how MAML can be extended to a variety of reinforcement learning scenarios, and can accommodate uncertainty via a Bayesian formulation.

Bio: Sergey Levine received a BS and MS in Computer Science from Stanford University in 2009, and a Ph.D. in Computer Science from Stanford University in 2014. He joined the faculty of the Department of Electrical Engineering and Computer Sciences at UC Berkeley in fall 2016. His work focuses on machine learning for decision making and control, with an emphasis on deep learning and reinforcement learning algorithms. Applications of his work include autonomous robots and vehicles, as well as computer vision and graphics. His research includes developing algorithms for end-to-end training of deep neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, deep reinforcement learning algorithms, and more.

Praneeth Netrapalli (MSR India)

Title: How to escape saddle points efficiently?

Abstract: Non-convex optimization is ubiquitous in modern machine learning applications. Gradient descent based algorithms (such as gradient descent or Nesterov's accelerated gradient descent) are widely used in practice since they have been observed to perform well on these problems. From a theoretical standpoint however, this is quite surprising since these nonconvex problems are teeming with highly suboptimal saddle points, and it is well known that gradient descent (and variants) can get stuck in such saddle points. A key question that arises in resolving this apparent paradox is whether gradient descent based algorithms escape saddle points where the Hessian has negative eigenvalues and converge to points where Hessian is positive semidefinite (such points are called second-order stationary points) efficiently. We answer this question in the affirmative by showing the following:

(1) Perturbed gradient descent converges to second-order stationary points in a number of iterations which depends only poly-logarithmically on dimension (i.e., it is almost “dimension-free”). The convergence rate of this procedure matches the wellknown convergence rate of gradient descent to first-order stationary points, up to log factors, and

(2) A variant of Nesterov’s accelerated gradient descent converges to second-order stationary points at a faster rate than perturbed gradient descent.

The key technical components behind these results are (1) understanding geometry around saddle points and (2) designing a potential function for Nesterov’s accelerated gradient descent for non-convex problems.

Based on joint works with Chi Jin, Rong Ge, Sham M. Kakade and Mike Jordan.

Bio: Praneeth Netrapalli is currently a researcher at Microsoft Research India, Bangalore. His main research agenda is to design efficient and provable algorithms for machine learning. Two approaches which have proved effective in this pursuit are nonconvex optimization and streaming/stochastic optimization. His initial work was one of the first to shed light on the performance of alternating minimization (a meta-heuristic which is widely used across several fields) on some well studied machine learning problems, and subsequently led to the design of faster algorithms for some more problems as well. His recent work is focused on designing faster algorithms for general nonconvex optimization as well as stochatic optimization.

Praneeth obtained B-tech from IIT Bombay in 2007 and MS and PhD from The University of Texas at Austin in 2011 and 2014 respectively, all in Electrical Engineering. From 2014 to 2016, he was a postdoctoral researcher at Microsoft Research New England, Cambridge MA. From 2007 to 2009, he worked as a quantitative analyst at Goldman Sachs, Bangalore where his work focused on evaluating prices and risk of financial derivatives.