Real World Offline Reinforcement Learning with Realistic Data Source
Gaoyue Zhou*, Liyiming Ke*, Siddhartha Srinivasa, Abhinav Gupta, Aravind Rajeswaran, Vikash Kumar
Publihsed as a conference paper at ICRA (2023)
Summary
Real-robot Offline RL (ORL) should focus on using multi-task, near-optimal data.
From our real robot experiments w/ 6500 trajectories, 800 robot hours, 270 hours human labor:
ORL generalizes better than Behavior Cloning on
(1) task space with low data support
(2) dynamic tasksFeeding ORL with more data is not guaranteed to boost performance.
Some ORL algorithms can surpass Behavior Cloning even on setting that traditionally favors imitation learning.
Task & Dataset
The dataset we collected is available for download at Dataset
Highlighted Results
A. Implicit Q-Learning is a strong baseline for even in-domain task
In traditional imitation learning, agent is given abundant, high-quality task data,
Implicit Q-Learning (IQL), surprisingly, achieved the highest scores on 2 out of 4 tasks.
It wins on more tasks than even Behavior Cloning.
B. ORL generalize better than BC to task space that lacks data support
If we have trained an agent to wipe a small table, can we ask it to generalize to wipe a larger table?
We remove a subset of trajectories so the demo data has less support for the "center region".
We found that all ORL agents had better performance in this region with relatively low data support.
C. ORL > BC on reusing offline data with a static goal for a task with dynamic goal
We ask the agent to follow a moving goal vector at test time, which can be viewed as a simplified version of daily tasks including drawing, wiping, etc.
BC had covariate shift & had trouble finishing the curves.
ORL algorithms mostly finished the ideal trajectories.
D. Reusing offline, multi-task data
Realistically applying ORL shall use multi-task dataset and hopefully make progress on a new target task.
(1) all ORL agents could improve performance on some tasks, but the changes vary by task and by data.
(2) Leveraging offline data did enable ORL agents to have better performance, even surpassing the best in-domain agents.
Representative Agents
BC, Pick-n-Place
MoREL, Lift
AWAC, Pick-n-Place
IQL, Slide
BC, Trace Shape "3"
MOREL, Trace "0"
AWAC, Trace "5"
IQL, Trace "8"