Training a robot to accomplish a new task is challenging and labor-intensive. In this work, we demonstrate that utilizing large, prior datasets containing a diverse amount of tasks simultaneously addresses several key problems in real-world robotic learning:
Sample-inefficiency: Reinforcement learning (RL) algorithms typically require a substantial amount of data, which may be time-consuming to collect on hardware.
Need for manual resets: Returning an environment to its initial state requires laborious human intervention and limits the ability to continually collect data.
Failure to generalize: Robotic RL policies often fail when deployed beyond the carefully controlled setting in which they were learned.
Our robotic policies map high-dimensional RGB images and robot states to a 7-dimensional action space. Rewards are provided in a sparse (+1/0) fashion by an object detection network.
In the offline stage (Phase 1), we train a multi-task policy that captures prior knowledge from an offline dataset of previously experienced tasks. Optionally, we collect a small number (~40) of human demos of the downstream new task we want to learn (Phase 0.5) and concatenate this data to the offline dataset in Phase 1 when training our policy.
Then, in the online stage (Phase 2), this multi-task policy is used to initialize learning for a new task, providing both a forward policy and a backward (reset) skill, and improving learning speed and generalization.
This approach leads to sample-efficient learning of generalizable policies with a significant reduction in the need for manual interventions (i.e., environment resets).
We collected data using scripted policies to pick up objects and place them into a container and policies to open a drawer and place an object inside. The photo shows the diverse objects and containers we used to construct the container pick-and-placing tasks in our experiments.
Upper: containers and objects used in the offline data for pre-training.
Lower: test-time containers and objects used as part of new tasks for online fine-tuning.
0 Trials
100 Trials
600 Trials
0 Trials
100 Trials
600 Trials
0 Trials
100 Trials
Policies initialized with multi-task data (ARIEL) demonstrate zero-shot generalization capabilities to objects not seen in the prior dataset, whereas policies trained on only single-task data do not.