Offline Reinforcement Learning at Multiple Frequencies

Kaylee Burns1, Tianhe Yu1,2, Chelsea Finn1,2, Karol Hausman1,2

1 Stanford University, 2 Robotics at Google

Paper | Code

Abstract

To leverage many sources of offline robot data, robots must grapple with the heterogeneity of such data. In this paper, we focus on one particular aspect of this challenge: learning from offline data collected at different control frequencies. Across labs, the discretization of controllers, sampling rates of sensors, and demands of a task of interest may differ, giving rise to a mixture of frequencies in an aggregated dataset. We study how well offline reinforcement learning (RL) algorithms can accommodate data with a mixture of frequencies during training. We observe that the Q-value propagates at different rates for different discretizations, leading to a number of learning challenges for off-the-shelf offline RL algorithms. We present a simple yet effective solution that enforces consistency in the rate of Q-value updates to stabilize learning. By scaling the value of N in N-step returns with the discretization size, we effectively balance Q-value propagation, leading to more stable convergence. On three simulated robotic control problems, we empirically find that this simple approach significantly outperforms naive mixing both in terms of absolute performance and training stability, while also improving over using only the data from a single control frequency.

Approach

When we mix data with multiple discretizations together, the value propagates along the state space at different rates. These regression targets are difficult to learn from.

We focus on the setting of mixing data with multiple different discretizations using offline RL. The goal of our method is to "align" the Bellman update rate for different discretizations so that they can be updated at a similar pace, creating more consistent value targets. Intuitively, we aim to mimic the "update stride" of coarser discretizations when training on more fine-grained discretizations to bring consistency to the Q-value updates. The core idea behind our approach is to utilize N-step returns as a tool that accelerates the propagation of Q-values. In particular, we calculate Q-targets with adaptive N-step returns, where we scale N by the value of the discretization, dt . Specifically, we use the next N/dt steps in computing the target for the Q-value:

Environments

We validate our approach on three simulated control environments. We generate data at different frame rates for offline training by storing replay buffers from off-policy training. For the pendulum environment, we focus on the swing up task and collect data with discretizations of 0.005, 0.01, and 0.02. For the Meta-World environment, we focus on the door-open task with frame skips of 1, 2, 5, and 10. For the FrankaKitchen environment, we focus on the subgoals of opening a microwave, moving a kettle, turning on a light, and opening a drawer. We mix expert data at 40 frame skips with replay data at 30 frame skips.

Pendulum Swing Up

Meta-World Door Open

FrankaKitchen

Experimental Results

We compare our method against three baselines:

  • Individual Training: Train separate policies for each discretization and train without data mixing

  • Naive Mixing: Mix data from different discretizations together under a single shared policy with discretization-dependent scalings

  • Max N-Step: Use N-step returns with the same value N for each discretization: N/min(dts). This ablates the effect of synchronizing the value of N.

Within every environment, the average performance of Adaptive N-Step exceeds individual training, which demonstrates that policy learning benefits from mixed data. On Pendulum and FrankaKitchen, Adaptive N-Step improves over Naive Mixing and Max N-Step, suggesting that the synchronization of the Q-value updates is important for these environments. In Meta-World door open, Adaptive N-Step improves the quality of the learned policy on the smallest discretization without hindering learning on the other discretizations.

Citation

@inproceedings{burns2022offlineRLAMF,

title={Offline Reinforcement Learning at Multiple Frequencies},

author={Kaylee Burns and Tianhe Yu and Chelsea Finn and Karol Hausman},

booktitle={arXiv preprint arXiv:2207.13082},

year={2022}

}

Acknowledgements

We thank Google for their support and anonymous reviewers for their feedback. We're grateful to Evan Liu and Kyle Hsu for reviewing drafts of our submission and Annie Xie for providing references to implementations.