Abstract
Offline reinforcement learning algorithms hold the promise of enabling data-driven RL methods that do not require costly or dangerous real-world exploration and benefit from large pre-collected datasets. This in turn can facilitate real-world applications, as well as a more standardized approach to RL research. Furthermore, offline RL methods can provide effective initializations for online finetuning to overcome challenges with exploration. However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e.g., length of the horizon, sparsity of rewards) and the parameters of the data (e.g., narrow demonstration data or broad exploratory data). While considerable progress in offline RL in recent years has been enabled by simpler benchmark tasks, the most widely used datasets are increasingly saturating in performance and may fail to reflect properties of realistic tasks. We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments, based on models of real-world robotic systems, and comprising a variety of data sources, including scripted data, play-style data collected by human teleoperators, and other data sources. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation, with some of the tasks specifically designed to require both pre-training and fine-tuning. We hope that our proposed benchmark will facilitate further progress on both offline RL and fine-tuning algorithms.
Why D5RL
The most widely used datasets for offline RL and pre-training are increasingly saturating in performance and do not reflect the challenges of deploying reinforcement learning to realistic systems. The design of D5RL was driven by the considerations to:
Help drive development of RL algorithms that are scalable to realistic systems.
Provide realistic datasets, reflective of real world data distributions, such as human teleoperation, sub-optimal/partial execution and play data.
Reflect realistic challenges of real world deployment, which include a variety of distribution shifts.
Tackle realistic reinforcement learning optimization challenges, such as narrow data, long horizons and multi-modal and heteroskedasic distributions.
Provide a testbed that is accessible and allows for quick itteration and experimentation, while still adressing the above issues.
We have also provided a unified highly performative library of offline RL algorithms, which can drive the next stage of benchmaring, evaluation and development of offline and pre-trainign algorithms.
Tasks
Legged Locomotion: The goal of the legged locomotion tasks is to study the efficacy of offline RL methods in handling low-level control problems with complex dynamics. We set up these tasks on a simulated Unitree A1 robot platform and require learning policies from low-dimensional proprioceptive observations.
Franka Kitchen Environment: Franka Kitchen environment which was introduced by Relay Policy Learning. The objective in this environment is to manipulate a set of 4 pre-specified objects. We modify the task to utilize multiple image observations rather than ground truth object locations, thus providing an observation space that more realistically reflects robotic manipulation scenarios.
Randomized Kitchen Environment: This environment is based on the Kitchen Shift design. The environment has large amount of domain randomization. This level of variability introduces a significant challenge in terms of robustness and representation learning, reflecting challenges likely to be seen in the real world. We also collected a new dataset of over 20 hours of human tele-operated play data. This distribution represents significant challenges due to it's multimodality, varying quality and long range.
WidowX Sorting Environment: The simulated robot is a 6-DOF WidowX arm placed in front of two identical white bins with 2 objects to sort. The goal of this task is to study composition of suboptimal trajectories to solve longer-horizon tasks, incorporate visual observations, and handle data from weak scripted policies. These ingredients reflect problems that are often encountered in offline robotic RL, where we might want to compose longer-horizon behaviors out of datasets depicting individual primitive skills