Abstract

Learning dexterous manipulation in high-dimensional state-action spaces is an important open challenge with exploration presenting a major bottleneck. Although in many cases the learning process could be guided by demonstrations or other suboptimal experts, current RL algorithms for continuous action spaces often fail to effectively utilize combinations of highly off-policy expert data and on-policy exploration data. As a solution, we introduce Relative Entropy Q-Learning (REQ), a simple policy iteration algorithm that combines ideas from successful offline and conventional RL algorithms. It represents the optimal policy via importance sampling from a learned prior and is well-suited to take advantage of mixed data distributions. We demonstrate experimentally that REQ outperforms several strong baselines on robotic manipulation tasks for which sub- optimal experts are available. We show how suboptimal experts can be constructed effectively by composing simple waypoint tracking controllers, and we also show how learned primitives can be combined with waypoint controllers to obtain reference behaviors to bootstrap a complex manipulation task on a simulated bimanual robot with human-like hands. Finally, we show that REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.

Paper Link

Left: Bimanual Shadow Hand LEGO stacking task with example waypoints. Middle: Motions are concatenated sequentially to construct a useful suboptimal expert. The robot arm end effector poses are controlled with waypoint tracking controllers, while the dexterous human-like hands are controlled with learned primitives. Right: The suboptimal experts’ actions can be intertwined with the current policy’s actions to achieve rich exploration. Note that it is also possible to record the actions that the suboptimal expert would have taken, shown as dotted lines.

Environments

Single Arm Stacking (State Size: 69, Action Size: 5)

Bimanual Insertion (State Size: 86, Action Size: 14)

Bimanual Cleanup (State Size: 129, Action Size: 14)

Bimanual Shadow (State Size: 176, Action Size: 52)

Relative Entropy Q-Learning

Learning from Suboptimal Experts

RLfD (RL from Demonstrations) and RLfSE (RL from Suboptimal Experts) results on the tasks shown previously. The dotted line represents the performance of the suboptimal expert and the solid line shows the performance of the best RLfD agent over CRRfD, MPOfD, REQfD and DDPGfD. In both the bimanual insertion and the cleanup tasks the performance of RLfD agents is zero. For the bimanual Shadow LEGO task, we only evaluate REQfSE as this task is significantly harder than the first three. Additionally results for RLfSE and RLfD are available in the Appendix.

Off-policy RL Benchmark

Online off-policy RL results on the DeepMind Control Suite [34]. MPO and SAC results matches [36] and [37] respectively. Note that SAC sometimes becomes unstable when run for longer.

[36] M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, S. Henderson, A. Novikov, S. G. Colmenarejo,S. Cabi, C. Gulcehre, T. L. Paine, A. Cowie, Z. Wang, B. Piot, and N. de Freitas. Acme: A research framework for distributed reinforcement learning, 2020.

[37] D. Yarats and I. Kostrikov.Soft actor-critic (sac) implementation in pytorch.https://github.com/denisyarats/pytorch_sac, 2020.

[34] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki,J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. Deepmind control suite, 2018.

Offline RL Benchmark

Offline RL results on the DeepMind Control Suite [34]. The reference numbers for baselines are from Novikov et al. [3].

[3] A. Novikov, Z. Wang, K. Zolna, J. T. Springenberg, S. Reed, B. Shahriari, N. Siegel, C. Gulcehre, N. Heess, and N. de Freitas. Critic regularized regression.In submission, 2020.

[34] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki,J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller. Deepmind control suite, 2018.

Citation

@misc{jeong2020learning,

title={Learning Dexterous Manipulation from Suboptimal Experts},

author={Rae Jeong and Jost Tobias Springenberg and Jackie Kay and Daniel Zheng and Yuxiang Zhou and Alexandre Galashov and Nicolas Heess and Francesco Nori},

year={2020},

eprint={2010.08587},

archivePrefix={arXiv},

primaryClass={cs.RO}

}