Methods

Methods

We now shortly describe 7 CL methods evaluated on our benchmark. Most of them were developed in the SL context, and in some cases, non-trivial adaptation to the RL setting was required. We aimed to cover different families of methods; following [survey], we consider three classes: regularization-based, parameter isolation, and replay methods.

Regularization-based Methods. This family builds on the observation that one can reduce forgetting by protecting parameters that are important for the previous tasks. The most basic approach often dubbed L2 [ewc] simply adds a L2 penalty, which regularizes the network not to stray away from the previously learned weights. In this approach, each parameter is equally important. Elastic Weight Consolidation [ewc] uses the Fisher information matrix to approximate the importance of each weight.

Memory-Aware Synapses [mas] also utilizes a weighted penalty, but the importance is obtained by approximating the impact each parameter has on the output of the network. Variational Continual Learning [vcl], follows a similar path, but uses variational inference to minimize the Kullback-Leibler divergence between the current distribution of parameters (posterior) and the distribution for the previous tasks (prior).

Parameter Isolation Methods. This family (also called modularity-based) forbids any changes to parameters that are important for the previous tasks. It may be considered as a ``hard'' equivalent of regularization-based methods. PakNet [packnet] ``packs'' multiple tasks into a single network by iteratively pruning, freezing, and retraining parts of the network at task change.

Replay Methods. Methods of this family keep some samples from the previous tasks and use them for training or as constraints to reduce forgetting. We use a Perfect Memory baseline, a modification of our setting which remembers all the samples from the past (i.e., without resetting the buffer at the task change). We also implemented Averaged Gradient Episodic Memory [agem], which projects gradients from new samples such to not interfere with previous tasks. We find that A-GEM does not perform well on our benchmark.


Papers:

[survey] A continual learning survey: Defying forgetting in classification tasks - M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, T. Tuytelaars

[ewc] Overcoming catastrophic forgetting in neural networks - J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell

[mas] Memory aware synapses: Learning what (not) to forget - R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, T. Tuytelaars

[vcl] Variational Continual Learning - C. Nguyen, Y. Li, T. Bui, R. Turner

[packnet] Packnet: Adding multiple tasks to a single network by iterative pruning - A. Mallya, S. Lazebnik

[agem] Efficient lifelong learning with A-GEM - A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny