Schedule

Friday, July 17th - Timezone: PDT

(Tentative) Schedule

08:00 AM - 08:15 AM: Introductory remarks

08:15 AM - 09:00 AM: Talk by Peter Richtarik: Fast linear convergence of randomized BFGS

09:00 AM - 09:10 AM: Q&A with Peter Richtarik

09:10 AM - 09:55 AM: Talk by Francis Bach: Second Order Strikes Back - Globally convergent Newton methods for ill-conditioned generalized self-concordant Losses

09:55 AM - 10:05 AM: Q&A with Francis Bach

10:05 AM - 10:30 AM: Break until 10:30am (PDT)

10:30 AM - 10:40 AM: Spotlight 1: A Second-Order Optimization Algorithm for Solving Problems Involving Group Sparse Regularization

10:40 AM - 10:50 AM: Spotlight 2: Ridge Riding: Finding diverse solutions by following eigenvectors of the Hessian

10:50 AM - 11:00 AM: Spotlight 3: PyHessian: Neural Networks Through the Lens of the Hessian

11:00 AM - 11:45 AM: Talk by Coralia Cartis: Dimensionality reduction techniques for large-scale optimization problems

11:45 AM - 11:55 AM: Q&A with Coralia Cartis

11:55 AM - 01:30 PM: Break until 13:30pm (PDT)

01:30 PM - 01:40 PM: Spotlight 4: MomentumRNN: Integrating Momentum into Recurrent Neural Networks

01:40 PM - 01:50 PM: Spotlight 5: Step-size Adaptation Using Exponentiated Gradient Updates

01:50 PM - 02:00 PM: Spotlight 6: Competitive Mirror Descent

02:00 PM - 02:15 PM: Industry Panel - Talk by Boris Ginsburg: Large scale deep learning: new trends and optimization challenges

02:15 PM - 02:30 PM: Industry Panel - Talk by Jonathan Hseu: ML Models in Production

02:30 PM - 02:45 PM: Industry Panel - Talk by Andres Rodriguez: Shifting the DL industry to 2nd order methods

02:45 PM - 03:00 PM: Industry Panel - Talk by Lin Xiao: Statistical Adaptive Stochastic Gradient Methods

03:00 PM - 03:30 PM: Industry panel Q&A

03:30 PM - 04:15 PM: Talk by Rachel Ward: Weighted Optimization: better generalization by smoother interpolation

04:15 PM - 04:25 PM: Q&A with Rachel Ward

04:25 PM - 05:00 PM: Break until 17:00pm (PDT)

05:00 PM - 05:45 PM: Talk by Rio Yokota: Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices

05:45 PM - 05:55 PM: Q&A with Rio Yokota

05:55 PM - 06:00 PM: Closing remarks

Abstract

Speaker: Francis Bach

Title: Second Order Strikes Back - Globally convergent Newton methods for ill-conditioned generalized self-concordant Losses

Abstact: We will study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n\sqrt{n}), a memory complexity of order O(n) and no dependence on the condition number, generalizing the results known for least-squares regression (Joint work with Ulysse Marteau-Ferey and Alessandro Rudi, https://arxiv.org/abs/1907.01771).


Speaker: Coralia Cartis

Title: Dimensionality reduction techniques for large-scale optimization problems

Abstact: Known by many names, sketching techniques allow random projections of data from high to low dimensions while preserving pairwise distances. This talk explores ways to use sketching so as to improve the scalability of algorithms for diverse classes of optimization problems and applications, from linear to nonlinear, local to global, derivative-based to derivative-free. Regression problems and Gauss-Newton techniques will receive particular attention. Numerical illustrations on standard optimization test problems as well as on some machine learning set-ups will be presented. This work is joint with Jan Fiala (NAG Ltd), Jaroslav Fowkes (Oxford), Estelle Massart (Oxford and NPL), Adilet Otemissov (Oxford and Turing), Alex Puiu (Oxford), Lindon Roberts (Australian National University, Canberra), Zhen Shao (Oxford).


Speaker: Boris Ginsburg

Title: Large scale deep learning: new trends and optimization challenges

Abstract: I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERT-large) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERT-approach”, when a model is first pre-trained in unsupervised or self-supervised manner on large unlabeled dataset, and then it is fine-tuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an application-driven design of new optimization methods using NovoGrad as example.


Speaker: Jonathan Hseu

Title: ML Models in Production

Abstract: We discuss the difficulties of training and improving ML models on large datasets in production. We also go into the process of an engineer working on ML models, and the challenges of trading off cost, model quality, and performance. Finally, we go into a wishlist of optimization improvements that could improve the workflow of engineers working on these models.


Speaker: Peter Richtarik

Title: Fast linear convergence of randomized BFGS

Abstract: Since the late 1950's when quasi-Newton methods first appeared, they have become one of the most widely used and efficient algorithmic paradigms for unconstrained optimization. Despite their immense practical success, there is little theory that shows why these methods are so efficient. We provide a semi-local rate of convergence for the randomized BFGS method which can be significantly better than that of gradient descent, finally giving theoretical evidence supporting the superior empirical performance of the method.


Speaker: Andres Rodriguez

Title: Shifting the DL industry to 2nd order methods

Abstract: In this talk, we review the topology design process used by data scientists and explain why 1st order methods are computational expensive in the design process. We explore the benefits of 2nd order methods to reduce the topology design cost and highlight recent work that approximates the inverse Hessian. We conclude with recommendations to accelerate the adoption of these methods in the DL ecosystem.


Speaker: Rachel Ward

Title: Weighted Optimization: better generalization by smoother interpolation

Abstract: We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. This is work with Yuege (Gail) Xie, Holger Rauhut, and Hung-Hsu Chou.


Speaker: Lin Xiao

Title: Statistical Adaptive Stochastic Gradient Methods

Abstract: Stochastic gradient descent (SGD) and its many variants serve as the workhorses of deep learning. One of the foremost pain points in using these methods in practice is hyperparameter tuning, especially the learning rate (step size). We propose a statistical adaptive procedure called SALSA to automatically schedule the learning rate for a broad family of stochastic gradient methods. SALSA first uses a smoothed line-search procedure to find a good initial learning rate, then automatically switches to a statistical method, which detects stationarity of the learning process under a fixed learning rate, and drops the learning rate by a constant factor whenever stationarity is detected. The combined procedure is highly robust and autonomous, and it matches the performance of the best hand-tuned methods in several popular deep learning tasks.


Speaker: Rio Yokota

Title: Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices

Abstract: Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid data-parallel + model-parallel approach.