Fall / Automne 2018

Thursday, December 13, 2018

Video Recording: https://bluejeans.com/s/jFLr7

Praneeth (MSR India)

On momentum methods and acceleration in stochastic optimization

It is well known that momentum gradient methods (e.g., Polyak's heavy ball, Nesterov's acceleration) yield significant improvements over vanilla gradient descent in deterministic optimization (i.e., where we have access to exact gradient of the function to be minimized). However, there is widespread sentiment that these momentum methods are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical errors (Paige, 1971; Proakis, 1974; Polyak, 1987; Greenbaum, 1989; Roy and Shynk, 1990; Sharma et al., 1998; d’Aspremont, 2008; Devolder et al., 2013, 2014; Yuan et al., 2016) but a precise understanding is lacking. This work considers these issues for the special case of stochastic approximation for the linear least squares regression problem, and shows that:

1. classical momentum methods (heavy ball and Nesterov's acceleration) indeed do not offer any improvement over stochastic gradient descent, and

2. introduces an accelerated stochatic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent (and classical momentum methods).

Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. While the results are rigorously established for the special case of linear least squares regression, experiments suggest that the conclusions hold for the training of deep neural networks.

Tuesday, December 11, 2018

Video Recording: https://bluejeans.com/s/fGERa

Aapo Hyvärinen (University College London + University of Helsinki)

Nonlinear independent component analysis: A principled framework for disentanglement

Unsupervised learning, in particular learning general nonlinear representations with disentanglement, is one of the deepest problems in machine learning. Estimating latent quantities in a generative model provides a principled framework, and has been successfully used in the linear case, e.g. with independent component analysis (ICA) and sparse coding. However, extending ICA to the nonlinear case has proven to be extremely difficult: A straight-forward extension is unidentifiable, i.e. it is not possible to recover those latent components that actually generated the data. Here, we show that this problem can be solved by using additional information either in the form of temporal structure or an additional, auxiliary variable. We start by formulating two generative models in which the data is an arbitrary but invertible nonlinear transformation of time series (components) which are statistically independent of each other. Drawing from the theory of linear ICA, we formulate two distinct classes of temporal structure of the components which enable identification, i.e. recovery of the original independent components. We show that in both cases, the actual learning can be performed by ordinary neural network training where only the input is defined in an unconventional manner, making software implementations. We further generalize the framework to the case where instead of temporal structure, an additional auxiliary variable is observed (e.g. audio in addition to video). Our methods are closely related to "self-supervised" methods heuristically proposed in computer vision, and also provide a theoretical foundation for such methods.

The talk is based on the following papers:


Friday, November 23, 2018

Video Recording: https://bluejeans.com/s/MZCfM

Slides: Here

Sarath Chandar (Mila + Brain)

RNNs, Long-term Dependencies, and Lifelong Learning

Part 1: Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies Modelling long-term dependencies is a challenge for recurrent neural networks. This is primarily due to the fact that gradients vanish during training, as the sequence length increases. Gradients can be attenuated by transition operators and are attenuated or dropped by activation functions. Canonical architectures like LSTM alleviate this issue by skipping information through a memory mechanism. We propose a new recurrent architecture (Non-saturating Recurrent Unit; NRU) that relies on a memory mechanism but forgoes both saturating activation functions and saturating gates, in order to further alleviate vanishing gradients. In a series of synthetic and real-world tasks, we demonstrate that the proposed model is the only model that performs among the top 2 models across all tasks with and without long-term dependencies when compared against a range of other architectures. Part 2: Training Recurrent Neural Networks for Lifelong Learning Capacity saturation and catastrophic forgetting are the central challenges of any parametric lifelong learning system. In this work, we study these challenges in the context of sequential supervised learning with an emphasis on recurrent neural networks. To evaluate the models in the life-long learning setting, we propose a curriculum-based, simple, and intuitive benchmark where the models are trained on a task with increasing levels of difficulty. As a step towards developing true lifelong learning systems, we unify Gradient Episodic Memory (a catastrophic forgetting alleviation approach) and Net2Net (a capacity expansion approach). Evaluation on the proposed benchmark shows that the unified model is more suitable than the constituent models for lifelong learning setting

Friday, November 16, 2018

Video Recording: https://bluejeans.com/s/XpdQC

Nicolas Loizou (FAIR)

Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods

In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent. We prove global non-assymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates (in L2 sense), and dual function values. We also show that the primal iterates converge at an accelerated linear rate in the L1 sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesaro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

Friday, November 9, 2018

Video Recording: https://bluejeans.com/s/dlayJ

Slides : Here

Hugo Larochelle (Mila + Brain)

Few-Shot Learning with Meta-Learning: Progress Made and Challenges Ahead

A lot of the recent progress on many AI tasks was enable in part by the availability of large quantities of labeled data. Yet, humans are able to learn concepts from as little as a handful of examples. Meta-learning is a very promising framework for addressing the problem of generalizing from small amounts of data, known as few-shot learning. In meta-learning, our model is itself a learning algorithm: it takes as input a training set and outputs a classifier. For few-shot learning, it is (meta-)trained directly to produce classifiers with good generalization performance for problems with very little labeled data. In this talk, I'll present an overview of the recent research that has made exciting progress on this topic (including my own) and will discuss the challenges as well as research opportunities that remain.

Friday, October 26, 2018

Video Recording: Not available

Slides: Here

Gauthier Gidel (Mila)

A Variational Inequality Perspective on Generative Adversarial Networks

Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend methods designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a novel computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.

Friday, October 19, 2018

Video Recording: Not available

Slides: Here

Nick Pawlowski (Imperial College London + FAIR)

Bayesian Deep Learning and Applications to Medical Imaging

Deep learning revolutionised the way we approach computer vision and medical image analysis. Regardless of improved accuracy scores and other metrics, deep learning methods tend to be overconfident on unseen data or even when predicting the wrong label. Bayesian deep learning offers a framework to alleviate some of these concerns by modelling the uncertainty over the weights generating those predictions. This talk will review some previous achievements of the field and introduce Bayes by Hypernet (BbH). BbH uses neural networks to parametrise the variational approximation of the distribution of the parameters. We present more complex parameter distribution, better robustness to adversarial examples, and improved uncertainties. Lastly, we present the use of Bayesian NNs for outlier detection in the medical imaging domain, particularly the application of Brain lesion detection.