Continual Vision-based Reinforcement Learning with Group Symmetries

Abstract

Continual reinforcement learning aims to sequentially learn a variety of tasks, retaining the ability to perform previously encountered tasks while simultaneously developing new policies for novel tasks. However, current continual RL approaches overlook the fact that certain tasks are identical under basic group operations like rotations or translations, especially with visual inputs. They may unnecessarily learn and maintain a new policy for each similar task, leading to poor sample efficiency and weak generalization capability. To address this, we introduce a unique Continual Vision-based Reinforcement Learning method that recognizes Group Symmetries, called COVERS, cultivating a policy for each group of equivalent tasks rather than individual tasks. COVERS employs a proximal policy optimization-based RL algorithm with an equivariant feature extractor and a novel task grouping mechanism that relies on the extracted invariant features. We evaluate COVERS on sequences of table-top manipulation tasks that incorporate image observations and robot proprioceptive information in both simulations and on real robot platforms. Our results show that COVERS accurately assigns tasks to their respective groups and significantly outperforms existing methods in terms of generalization capability.

Video

covers_video.mp4

COVERS

COVERS grows a PPO-based policy with an equivariant feature extractor for each group, instead of a single task, to solve unseen tasks in seen groups in a zero-shot manner. It utilizes a novel unsupervised task grouping mechanism, which automatically detects group boundaries based on 1-Wasserstein distance in the invariant feature space. Moreover, COVERS handles inputs with multiple modalities, including images and robot proprioceptive states. 

Fig. 1. Policy Architecture.

Fig. 2. Calculation of 1-Wasserstein Distance.

Simulation Experiments

We validate COVERS's performance in non-stationary table-top manipulation environments. We show that (a) the group symmetric information from the equivariant feature extractor promotes the algorithm's adaptivity by maximizing the positive interference within each group, and (b) the task grouping mechanism recovers the ground truth group indexes, which helps minimize the negative interference among different groups. 

Fig. 3. Simulation Environment Setup.

Fig. 4. Training curves.

Real-world Validation

Plate Slide

Button Press

Drawer Close

Goal Reach

The real-world experimental environments.