Beyond first-order methods in ML Systems

Schedule

Date: Saturday, July 24th 2021 - Time Zone: PDT

Schedule

07:00 AM - 07:10 AM: Introductory remarks

07:10 AM - 07:55 AM: Talk by Stefania Bellavia: Recent trends in regularization methods with adaptive accuracy requirements

07:55 AM - 08:05 AM: Q&A with Stefania Bellavia

08:05 AM - 08:50 AM: Talk by Clement Royer: Conjugate gradient techniques for nonconvex optimization

08:50 AM - 09:00 AM: Q&A with Clement Royer

09:00 AM - 09:20 AM: Break

09:20 AM - 10:05 AM: Talk by Frank Curtis: Algorithms for Deterministically Constrained Stochastic Optimization

10:05 AM - 10:15 AM: Q&A with Frank Curtis

10:15 AM - 11:00 AM: Talk by Courtney Paquette: SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

11:00 AM - 11:10 AM: Q&A with Courtney Paquette

11:10 AM - 11:20 AM: Spotlight 1: Convergence Analysis and Implicit Regularization of Feedback Alignment for Deep Linear Networks

11:20 AM - 11:30 AM: Spotlight 2: Computing the Newton-step faster than Hessian accumulation

11:30 AM - 01:00 PM: Break

01:00 PM - 01:45 PM: Talk by Ashia Wilson: Descent method framework in optimization

01:45 PM - 01:55 PM: Q&A with Ashia Wilson

01:55 PM - 02:40 PM: Talk by Jelena Diakonikolas: Faster Empirical Risk Minimization

02:40 PM - 02:50 PM: Q&A with Jelena Diakonikolas

02:50 PM - 03:15 PM: Break

03:15 PM - 03:25 PM: Spotlight 3: Regularized Newton Method with Global O(1/k^2) Convergence

03:25 PM - 03:35 PM: Spotlight 4: Structured second-order methods via natural-gradient descent

03:35 PM - 03:45 PM: Spotlight 5: Implicit Regularization in Overparameterized Bilevel Optimization

03:45 PM - 04:30 PM: Talk by Quanquan Gu: Stochastic Variance-Reduced High-order Optimization for Nonconvex Optimization

04:30 PM - 04:40 PM: Q&A with Quanquan Gu

04:40 PM - : Closing remarks

Abstracts

Speaker: Stefania Bellavia

Title: Recent trends in regularization methods with adaptive accuracy requirements

Abstract: In this talk we present regularization methods employing inexact function and derivatives evaluations. They feature a very flexible adaptive mechanism for determining the inexactness which is allowed, at each iteration, when computing objective function and derivatives, in order to preserve the complexity results of their exact counterpart. The complexity analysis covers arbitrary optimality order and arbitrary degree of available approximate derivatives.

In most applications the accuracy requirements can be satisfied only within a certain probability. As a first step to cope with this non-deterministic aspect, complexity results for the case of exact functions and randomly perturbed derivatives are provided. We also analyse the key computational aspects related to an efficient implementation of the method of this class employing a cubic regularized second-order model and provide some numerical results showing the behaviour of the method.

Authors: Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, Philippe Toint

Speaker: Clement Royer

Title: Conjugate gradient techniques for nonconvex optimization

Abstract: Nonconvex optimization formulations have recently gained popularity due to their prevalence in deep learning applications, but they also arise in numerous problems from data analysis and robust statistics. In this setting, second-order stationary points often correspond to global optima, which motivates the development of second-order methods with fast convergence capabilities. Meanwhile, the study of acceleration phenomena, initially proposed in convex optimization, has expanded to nonconvex formulations, and lead to revisiting of popular large-scale nonlinear optimization frameworks to equip them with complexity guarantees while maintaining their practical appeal.

In this talk, we will review recent advances in employing Krylov methods to accelerate optimization procedures in a nonconvex setting. We will particularly focus on linear conjugate gradient and its variations, and we will show that this method is naturally built to detect nonconvexity on quadratic problems. We will then move to a general nonconvex nonlinear optimization setting, and describe a Newton-Conjugate Gradient method with best known complexity guarantees. Numerical illustrations of the performance of this method and extensions will be provided, with a focus on nonconvex instances in data science.

Speaker: Frank Curtis

Title: Algorithms for Deterministically Constrained Stochastic Optimization

Abstract: We discuss algorithms for solving optimization problems in which the objective function is defined by an expectation and the constraints are deterministic. Numerous algorithms exist for solving such problems when the constraints define a convex feasible region. By contrast, we propose algorithms for situations in which the constraints are nonlinear equations and/or inequalities that define nonconvex feasible regions. Our algorithms, for which convergence and worst-case complexity guarantees are offered in expectation, vastly outperform naive extensions of first-order methods to this constrained setting.

Speaker: Courtney Paquette

Title: SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

Abstract: In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of stochastic gradient descent (SGD) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit. Furthermore, the limiting dynamics are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates. These rates show significant improvement over the worst-case complexities.

Speaker: Ashia Wilson

Title: Descent method framework in optimization

Abstract: We present a framework for going beyond first order methods we refer to as the “descent method framework.” We construct a family of algorithms using intuition from dynamical systems, and show how they obtain fast non-asymptotic convergence for various classes of smooth functions. We also present a framework for “accelerating” or adding momentum descent algorithms with better theoretical complexity guarantees. We end by illustrating how this framework can be extended to manifolds and possibly lead to new techniques and analyses that incorporate geometric properties of various types of parameter spaces.

Speaker: Jelena Diakonikolas

Title: Faster Empirical Risk Minimization

Abstract: Empirical Risk Minimization (ERM) problems are central to machine learning, and their efficient optimization has been studied from different perspectives, often taking advantage of the finite sum structure present in typical problem formulations. In particular, tight oracle complexity bounds have been obtained under fairly general assumptions about the loss functions. In this talk, I will present a rather surprising and general result that takes advantage of the separability of nonsmooth convex loss functions with efficiently computable proximal operators -- such as, e.g., the hinge loss and the sum of absolute errors -- to obtain an algorithm that exhibits significantly lower complexity than what is predicted by the lower bounds for general nonsmooth convex losses. Crucial to this result is the use of proximal operators, which goes beyond the simple gradient oracle access to the function. The talk is based on joint work with Chaobing Song and Stephen Wright.

Speaker: Quanquan Gu

Title: Stochastic Variance-Reduced High-order Optimization for Nonconvex Optimization

Abstract: High-order optimization methods, such as cubic regularization methods, have attracted great interest in recent years due to their power to better leverage the optimization landscape. To apply it to large-scale optimization in machine learning, it is of great interest to extend it to stochastic optimization. In this talk, I will introduce a stochastic variance-reduced p-th-order method for finding first-order stationary points in nonconvex finite-sum optimization. Our algorithm enjoys state-of-the-art complexities under several complexity measures including gradient and Hessian sample complexities. I will also introduce corresponding lower bound results to suggest the near-optimality of our algorithm under some specific circumstances.

Page updated

Google Sites

Report abuse