Hierarchically Structured Task-Agnostic Continual Learning

Institute of Neural Information Processing, Ulm University

See full paper at the arxiv pre-print - Find our code at Github

Accepted at the ICRL 2022 Workshop on Agent Learning in Open-Endedness

Abstract

One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.

Modular Architectures for Efficient Task-Agnostic Adaptation

We introduce a novel modular neural network layer composed of a set of experts combined by a learned gating policy (right), as opposed to a multi-head architecture with deterministic head (left). Experts are stochastic and maintain a distribution over their parameters. Information-theoretic constraints ensure that they close to solutions learned for previous tasks.

Efficient adaptation is a crucial component of biological learning systems

Yet, current machine learning methods fail at continual learning and suffer from catastrophic forgetting, which gives rise to continual learning algorithms. Most methods are task-aware and specific to either supervised learning or reinforcement learning scenarios.

Expert Diversity is crucial to fight Catastrophic Forgetting

To improve expert specialization we introduce a novel DPP-based diversity objective that aimes to maximize the distance of the expert posteriors. To this effect, we use Wasserstein exponential kernels and maximize the determinant of the kernel matrix for each layer.

Continual Supervised Learning Experiments

We evaluate in split MNIST, permuted MNIST, split CIFAR-10, and split CIFAR-100. All experiments are under domain-incremental conditions. Our method (HVCL and HCVL w/ GR) shows competitive performance on all three benchmarks. We compare against task-agnostic and task-aware methods.

Continual Reinforcement Learning Experiments

We evaluate on a series of continuous reinforcement learning environments and compare against EWC and UCL, which are both task-aware. Our method outperforms the baseline (simple SAC) as as well as EWC and UCL, despite being task-agnostic and thus having less information.

Page updated

Google Sites