Schedule
Friday, July 17th - Timezone: PDT
(Tentative) Schedule
08:00 AM - 08:15 AM: Introductory remarks
08:15 AM - 09:00 AM: Talk by Peter Richtarik: Fast linear convergence of randomized BFGS
09:00 AM - 09:10 AM: Q&A with Peter Richtarik
09:10 AM - 09:55 AM: Talk by Francis Bach: Second Order Strikes Back - Globally convergent Newton methods for ill-conditioned generalized self-concordant Losses
09:55 AM - 10:05 AM: Q&A with Francis Bach
10:05 AM - 10:30 AM: Break until 10:30am (PDT)
10:30 AM - 10:40 AM: Spotlight 1: A Second-Order Optimization Algorithm for Solving Problems Involving Group Sparse Regularization
10:40 AM - 10:50 AM: Spotlight 2: Ridge Riding: Finding diverse solutions by following eigenvectors of the Hessian
10:50 AM - 11:00 AM: Spotlight 3: PyHessian: Neural Networks Through the Lens of the Hessian
11:00 AM - 11:45 AM: Talk by Coralia Cartis: Dimensionality reduction techniques for large-scale optimization problems
11:45 AM - 11:55 AM: Q&A with Coralia Cartis
11:55 AM - 01:30 PM: Break until 13:30pm (PDT)
01:30 PM - 01:40 PM: Spotlight 4: MomentumRNN: Integrating Momentum into Recurrent Neural Networks
01:40 PM - 01:50 PM: Spotlight 5: Step-size Adaptation Using Exponentiated Gradient Updates
01:50 PM - 02:00 PM: Spotlight 6: Competitive Mirror Descent
02:00 PM - 02:15 PM: Industry Panel - Talk by Boris Ginsburg: Large scale deep learning: new trends and optimization challenges
02:15 PM - 02:30 PM: Industry Panel - Talk by Jonathan Hseu: ML Models in Production
02:30 PM - 02:45 PM: Industry Panel - Talk by Andres Rodriguez: Shifting the DL industry to 2nd order methods
02:45 PM - 03:00 PM: Industry Panel - Talk by Lin Xiao: Statistical Adaptive Stochastic Gradient Methods
03:00 PM - 03:30 PM: Industry panel Q&A
03:30 PM - 04:15 PM: Talk by Rachel Ward: Weighted Optimization: better generalization by smoother interpolation
04:15 PM - 04:25 PM: Q&A with Rachel Ward
04:25 PM - 05:00 PM: Break until 17:00pm (PDT)
05:00 PM - 05:45 PM: Talk by Rio Yokota: Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices
05:45 PM - 05:55 PM: Q&A with Rio Yokota
05:55 PM - 06:00 PM: Closing remarks
Abstract
Speaker: Francis Bach
Title: Second Order Strikes Back - Globally convergent Newton methods for ill-conditioned generalized self-concordant Losses
Abstact: We will study large-scale convex optimization algorithms based on the Newton method applied to regularized generalized self-concordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular first-order methods but with an improved behavior, in particular in ill-conditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(n\sqrt{n}), a memory complexity of order O(n) and no dependence on the condition number, generalizing the results known for least-squares regression (Joint work with Ulysse Marteau-Ferey and Alessandro Rudi, https://arxiv.org/abs/1907.01771).
Speaker: Coralia Cartis
Title: Dimensionality reduction techniques for large-scale optimization problems
Abstact: Known by many names, sketching techniques allow random projections of data from high to low dimensions while preserving pairwise distances. This talk explores ways to use sketching so as to improve the scalability of algorithms for diverse classes of optimization problems and applications, from linear to nonlinear, local to global, derivative-based to derivative-free. Regression problems and Gauss-Newton techniques will receive particular attention. Numerical illustrations on standard optimization test problems as well as on some machine learning set-ups will be presented. This work is joint with Jan Fiala (NAG Ltd), Jaroslav Fowkes (Oxford), Estelle Massart (Oxford and NPL), Adilet Otemissov (Oxford and Turing), Alex Puiu (Oxford), Lindon Roberts (Australian National University, Canberra), Zhen Shao (Oxford).
Speaker: Boris Ginsburg
Title: Large scale deep learning: new trends and optimization challenges
Abstract: I will discuss two major trends in the deep learning. The first trend is an exponential growth in the size of models: from 340M (BERT-large) in 2018 to 175B (GPT3) in 2020. We need new, more memory efficient algorithms to train such huge models. The second trend is “BERT-approach”, when a model is first pre-trained in unsupervised or self-supervised manner on large unlabeled dataset, and then it is fine-tuned for another task using a. smaller labeled dataset. This trend sets new theoretical problems. Next, I will discuss a practical need in theoretical foundation for regularization methods used in the deep learning practice: data augmentation, dropout, label smoothing etc. Finally, I will describe an application-driven design of new optimization methods using NovoGrad as example.
Speaker: Jonathan Hseu
Title: ML Models in Production
Abstract: We discuss the difficulties of training and improving ML models on large datasets in production. We also go into the process of an engineer working on ML models, and the challenges of trading off cost, model quality, and performance. Finally, we go into a wishlist of optimization improvements that could improve the workflow of engineers working on these models.
Speaker: Peter Richtarik
Title: Fast linear convergence of randomized BFGS
Abstract: Since the late 1950's when quasi-Newton methods first appeared, they have become one of the most widely used and efficient algorithmic paradigms for unconstrained optimization. Despite their immense practical success, there is little theory that shows why these methods are so efficient. We provide a semi-local rate of convergence for the randomized BFGS method which can be significantly better than that of gradient descent, finally giving theoretical evidence supporting the superior empirical performance of the method.
Speaker: Andres Rodriguez
Title: Shifting the DL industry to 2nd order methods
Abstract: In this talk, we review the topology design process used by data scientists and explain why 1st order methods are computational expensive in the design process. We explore the benefits of 2nd order methods to reduce the topology design cost and highlight recent work that approximates the inverse Hessian. We conclude with recommendations to accelerate the adoption of these methods in the DL ecosystem.
Speaker: Rachel Ward
Title: Weighted Optimization: better generalization by smoother interpolation
Abstract: We provide a rigorous analysis of how implicit bias towards smooth interpolations leads to low generalization error in the overparameterized setting. We provide the first case study of this connection through a random Fourier series model and weighted least squares. We then argue through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. This is work with Yuege (Gail) Xie, Holger Rauhut, and Hung-Hsu Chou.
Speaker: Lin Xiao
Title: Statistical Adaptive Stochastic Gradient Methods
Abstract: Stochastic gradient descent (SGD) and its many variants serve as the workhorses of deep learning. One of the foremost pain points in using these methods in practice is hyperparameter tuning, especially the learning rate (step size). We propose a statistical adaptive procedure called SALSA to automatically schedule the learning rate for a broad family of stochastic gradient methods. SALSA first uses a smoothed line-search procedure to find a good initial learning rate, then automatically switches to a statistical method, which detects stationarity of the learning process under a fixed learning rate, and drops the learning rate by a constant factor whenever stationarity is detected. The combined procedure is highly robust and autonomous, and it matches the performance of the best hand-tuned methods in several popular deep learning tasks.
Speaker: Rio Yokota
Title: Degree of Approximation and Overhead of Computing Curvature, Information, and Noise Matrices
Abstract: Hessian, Fisher, and Covariance matrices are not only used for preconditioning optimizers, but also in generalization metrics, predicting hyperparameters, and Bayesian inference. These matrices contain valuable information that can advance theory in statistical learning, but they are very expensive to compute exactly for modern deep neural networks with billions of parameters. We make use of a highly optimized implementation for computing these matrices with various degrees of approximation to close the gap between theory and practice. We are able to significantly reduce the overhead of computing these matrices through a hybrid data-parallel + model-parallel approach.