Lifelong Robotic Reinforcement Learning via Retaining Experiences

Annie Xie and Chelsea Finn

Paper: https://arxiv.org/abs/2109.09180

Abstract: Multi-task learning ideally allows robots to acquire a diverse repertoire of useful skills. However, many multi-task reinforcement learning efforts assume the robot can collect data from all tasks at all times. In reality, the tasks that the robot learns arrive sequentially, depending on the user and the robot's current environment. In this work, we formalize a practical sequential multi-task RL problem that is motivated by the practical constraints of physical robotic systems. We analyze several of the design decisions for algorithms in such settings, and derive an approach that effectively leverages the data and policies learned for previous tasks to cumulatively grow the robot's skill-set. In a series of simulated robotic manipulation experiments, we find this approach accelerates learning of each additional task, requiring less than half the number of samples than learning each of the tasks from scratch, while avoiding impractical round-robin data collection. On a Franka Emika Panda robot arm, our approach can incrementally learn a policy for 6 challenging tasks, including bottle capping and block insertion.

In our framework, we perform two steps during each new task. First, we pre-train on the prior experience from earlier tasks (left). To align the data to the same objective, we use the underlying reward function of the upcoming task to relabel this experience before pre-training. Second, we learn online in the robot's physical environment and gather new data to continuously improve the robot's policy until the task is solved (right).

Experimental Results

A Franka Emika robot arm learns a sequence of manipulation tasks, including object insertion and bottle capping, by retaining experience from previously solved tasks. Our algorithm can learn a sequence of tasks with different physical setups and objectives with fewer samples compared to when each task is learned from scratch.

Task 2

Insert marker #1

Task 3

Insert eraser

Task 4

Cap bottle #1

Task 5

Cap bottle #2

Task 6

Insert block #1

Task 7

Insert block #2

Task 8

Insert block #3

Task 9

Cap bottle #3

Task 10

Insert marker #2

Quantitative Results

The learning curve averaged across tasks on the physical robot. For each data-point, we average the distance to goal at the final episode time-step across the 10 trials for each of the 5 tasks. The error bars represent 95% confidence intervals.