Heinke Hihn and Daniel A. Braun
Institute of Neural Information Processing, Ulm University
See full paper at the arxiv pre-print - Find our code at Github
Accepted at the ICRL 2022 Workshop on Agent Learning in Open-Endedness
Abstract
One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning.
We introduce a novel modular neural network layer composed of a set of experts combined by a learned gating policy (right), as opposed to a multi-head architecture with deterministic head (left). Experts are stochastic and maintain a distribution over their parameters. Information-theoretic constraints ensure that they close to solutions learned for previous tasks.
Yet, current machine learning methods fail at continual learning and suffer from catastrophic forgetting, which gives rise to continual learning algorithms. Most methods are task-aware and specific to either supervised learning or reinforcement learning scenarios.
To improve expert specialization we introduce a novel DPP-based diversity objective that aimes to maximize the distance of the expert posteriors. To this effect, we use Wasserstein exponential kernels and maximize the determinant of the kernel matrix for each layer.
We evaluate in split MNIST, permuted MNIST, split CIFAR-10, and split CIFAR-100. All experiments are under domain-incremental conditions. Our method (HVCL and HCVL w/ GR) shows competitive performance on all three benchmarks. We compare against task-agnostic and task-aware methods.
We evaluate on a series of continuous reinforcement learning environments and compare against EWC and UCL, which are both task-aware. Our method outperforms the baseline (simple SAC) as as well as EWC and UCL, despite being task-agnostic and thus having less information.