Realistic Offline RL

Real World Offline Reinforcement Learning with Realistic Data Source

Gaoyue Zhou*, Liyiming Ke*, Siddhartha Srinivasa, Abhinav Gupta, Aravind Rajeswaran, Vikash Kumar

Publihsed as a conference paper at ICRA (2023)

| ArXiv | Dataset ZIP | Github |

Summary

Real-robot Offline RL (ORL) should focus on using multi-task, near-optimal data.

From our real robot experiments w/ 6500 trajectories, 800 robot hours, 270 hours human labor:

ORL generalizes better than Behavior Cloning on
(1) task space with low data support
(2) dynamic tasks
Feeding ORL with more data is not guaranteed to boost performance.
Some ORL algorithms can surpass Behavior Cloning even on setting that traditionally favors imitation learning.

v1.0.mp4

Task & Dataset

The dataset we collected is available for download at Dataset

Highlighted Results

A. Implicit Q-Learning is a strong baseline for even in-domain task

In traditional imitation learning, agent is given abundant, high-quality task data,
Implicit Q-Learning (IQL), surprisingly, achieved the highest scores on 2 out of 4 tasks.
It wins on more tasks than even Behavior Cloning.

B. ORL generalize better than BC to task space that lacks data support

If we have trained an agent to wipe a small table, can we ask it to generalize to wipe a larger table?
We remove a subset of trajectories so the demo data has less support for the "center region".
We found that all ORL agents had better performance in this region with relatively low data support.

C. ORL > BC on reusing offline data with a static goal for a task with dynamic goal

We ask the agent to follow a moving goal vector at test time, which can be viewed as a simplified version of daily tasks including drawing, wiping, etc.

BC had covariate shift & had trouble finishing the curves.
ORL algorithms mostly finished the ideal trajectories.

D. Reusing offline, multi-task data

Realistically applying ORL shall use multi-task dataset and hopefully make progress on a new target task.
(1) all ORL agents could improve performance on some tasks, but the changes vary by task and by data.
(2) Leveraging offline data did enable ORL agents to have better performance, even surpassing the best in-domain agents.

Representative Agents

BC, Pick-n-Place

MoREL, Lift

AWAC, Pick-n-Place

IQL, Slide

BC, Trace Shape "3"

MOREL, Trace "0"

AWAC, Trace "5"

IQL, Trace "8"