Ryan Sander, Wilko Schwarting, Tim Seyde, Igor Gilitschenski, Sertac Karaman, Daniela Rus
MIT CSAIL, University of Toronto, Toyota Research Institute, MIT LIDS
NMER is a novel replay buffer technique designed for improving continuous control tasks that recombines previous experiences of deep reinforcement learning agents linearly through a simple geometric heuristic.
Abstract
Experience replay plays a crucial role in improving the sample efficiency of deep reinforcement learning agents. Recent advances in experience replay propose the use of Mixup to further improve sample efficiency via synthetic sample generation. We build upon this idea with Neighborhood Mixup Experience Replay (NMER), a modular replay buffer that interpolates transitions with their closest neighbors in normalized state-action space. NMER preserves a locally linear approximation of the transition manifold by only performing Mixup between transitions with similar state-action features. Under NMER, a given transition’s set of state-action neighbors is dynamic and episode agnostic, in turn encouraging greater policy generalizability via cross-episode interpolation. We combine our approach with recent off-policy deep reinforcement learning algorithms and evaluate on several continuous control environments. We observe that NMER improves sample efficiency by an average 87% (TD3) and 29% (SAC) over baseline replay buffers, enabling agents to effectively recombine previous experiences and learn from limited data.
Approach
NMER trains off-policy MF-DRL agents using convex linear combinations of an agent's existing, proximal experiences, effectively creating locally linear models centered around each transition of the replay buffer. NMER consists of two steps:
Update Step: When a new environment interaction is added to the replay buffer, process and update the needed nearest neighbor structures and standardization mechanisms.
Sampling Step: First, we sample a batch of "sample transitions" uniformly from the replay buffer. Next, we query the nearest neighbors of each transition in this sampled batch. Following this, for each set of neighbors in the training batch, we sample a neighbor transition uniformly from this set of neighbors and apply Mixup to linearly interpolate each pair of selected samples and neighbors:
Empirical Results
The results of this continuous control evaluation study indicate that NMER frequently achieves comparatively better sample efficiency than the baseline replay buffers used in this study across SAC (Soft Actor-Critic) and TD3 (Twin-Delayed Deep Deterministic Policy Gradient), as well as other baseline DRL algorithms.
TD3 Results
SAC Results
Trained Agents
Please find videos of our trained agents below, for TD3 and SAC.
TD3
SAC
Acknowledgements
This research was supported by the Toyota Research Institute (TRI). This article solely reflects the opinions and conclusions of its authors and not TRI, Toyota, or any other entity. We thank TRI for their support. The authors thank the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC and consultation resources that have contributed to the research results reported within this publication.
Reference
If you find our paper useful, please consider citing it using the BibTeX below.
@inproceedings{sander2022neighborhood,
title={Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks},
author={Sander, Ryan and Schwarting, Wilko and Seyde, Tim and Gilitschenski, Igor and Karaman, Sertac and Rus, Daniela},
booktitle={Learning for Dynamics and Control Conference},
pages={954--967},
year={2022},
organization={PMLR}
}