Continual World
Overview
Continual learning (CL) --- the ability to continuously learn, building on previously acquired knowledge --- is a natural requirement for long-lived autonomous reinforcement learning (RL) agents. While building such agents, one needs to balance opposing desiderata, such as constraints on capacity and compute, the ability to not catastrophically forget, and to exhibit positive transfer on new tasks. Understanding the right trade-off is conceptually and computationally challenging, which we argue has led the community to overly focus on catastrophic forgetting. In response to these issues, we advocate for the need to prioritize forward transfer and propose Continual World, a benchmark consisting of realistic and meaningfully diverse robotic tasks built on top of Meta-World as a testbed. Following an in-depth empirical evaluation of existing CL methods, we pinpoint their limitations and highlight unique algorithmic challenges in the RL setting. Our benchmark aims to provide a meaningful and computationally inexpensive challenge for the community and thus help better understand the performance of existing and future solutions.
CW10 and CW20 sequences
The core of our benchmark is CW20 sequence. Out of 50 tasks defined in MetaWorld we picked those that are not too easy or too hard in the assumed sample budget 1M. The tasks and their ordering was based on the transfer matrix so that there is a high variation of forward transfers (both in the whole list and locally). We refer to these ordered tasks as CW10, and CW20 is CW10 repeated twice. We recommend using Cw10 for final evaluation; however, CW10 is already very informative in most cases. The proposed CW10 sequence consists of: hammer-v1, push-wall-v1, faucet-close-v1, push-back-v1, stick-pull-v1.
Transfer matrix Generally, the relationship between tasks and its impact on learning dynamics of neural networks is hard to quantify. To this end, we consider a minimal setting, in which we finetune on task t_2 a model pretrained on t_1, using the same protocol as the benchmark. This provides neural network-centric insight into the relationship between tasks and allows us to measure low-level transfer between tasks, i.e., the ability of the model to reuse previously acquired features.
Transfer matrix. Each cell represents the forward transfer from the first task to the second one. We shaded the cells for which 0 belongs to their 90\% confidence interval
Results and conclusions.
Here view average success over the training and the performance on single tasks.
We evaluated a set of 7 representative CL methods on our Continual World benchmark. We focus on forgetting and transfers while keeping fixed constraints on computation, memory, number of samples, and neural network architecture. Our main empirical contributions are experiments on the long CW20 sequence and following high-level conclusions.
Performance. The performance averaged over tasks is a typical metric for the CL setting. PackNet seems to outperform other methods, approaching 0.8 from the maximum of 1.0. Other methods perform considerably worse. A-GEM and Perfect Memory struggle.
Forgetting. We observe that most CL methods are usually efficient in mitigating forgetting. However, we did not notice any boost when revisiting a task. Even if a different output head was employed, relearning the internal representation should have had an impact unless it changed considerably when revisiting the task. Additionally we found A-GEM difficult to tune; consequently with the best hyperparameter settings, it is relatively similar to the baseline fine-tuning method
Transfers. For all methods, forward transfer for the second ten tasks (and same tasks are revisited) drops compared to the first ten tasks. This is in stark contrast to forgetting, which seems to be well under control. Among all methods, only fine-tuning and PackNet are able to achieve positive forward transfer (0.20 and 0.18, resp.) as well as on the first (0.32 and 0.22, resp.) and the second (0.08 and 0.14$ resp.) half of tasks. However, these are considerably smaller than RT=0.46, which in principle can even be exceeded, and which should be reached by a model that remembers all meaningful aspects of previously seen tasks. These results paint a pretty grim picture: we would expect improvement, rather than deterioration in performance, when revisiting previously seen tasks. There could be multiple reasons for this state of affairs. It could be attributed to the loss of plasticity. Another reason could be related to the interference between CL mechanisms or setting and RL, for instance, hindering exploration.
PackNet. PackNet stands out in our evaluations. We conjecture that developing related methods might be a promising research direction. Besides further increasing performance, one could mitigate the limitations of PackNet. PackNet relies on knowing task identity during evaluation. While this assumption is met in our benchmark, it is an interesting topic for future research to develop methods that cope without task identity. Another nuisance is that PackNet assigns some fixed fraction of parameters to a task. This necessitates knowledge of the length of the sequence in advance. Additionally, when the second ten tasks of CW20 start, PackNet performance degrades, showing its potentially inefficient use of capacity and past knowledge, given that the second ten tasks are identical with the first ten and hence no additional capacity is needed.
In a broader context, we speculate that parameter isolation methods might be a promising direction towards better CL methods.
Backward transfer. In our experiments, we did not observe any substantial cases of backward transfer, even though the benchmark is well suited to study this question due to the revisiting of tasks.