Dynamics-Aware Unsupervised Discovery of Skills

International Conference on Learning Representations (ICLR), 2020

OpenReview, arXiv, code, Google AI Blog, long talk

Abstract

Conventionally, model-based reinforcement learning (MBRL) aims to learn a global model for the dynamics of the environment. A good model can potentially enable planning algorithms to generate a large variety of behaviors and solve diverse tasks. However, learning an accurate model for complex dynamical systems is difficult, and even then, the model might not generalize well outside the distribution of states on which it was trained. In this work, we propose to combine model-based learning with model-free learning of primitives that make model-based planning easy. To that end, we aim to answer the question: how can we discover skills whose outcomes are easy to predict? We propose an unsupervised learning algorithm, Dynamics-Aware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics. Our method can leverage continuous skill spaces, theoretically, allowing us to learn infinitely many behaviors even for high-dimensional state-spaces. We demonstrate that zero-shot planning in the learned latent space significantly outperforms standard MBRL and model-free goal-conditioned RL, can handle sparse-reward tasks, and substantially improves over prior hierarchical RL methods for unsupervised skill discovery.

Skill Discovery using DADS

We demonstrate how DADS can serve as a general purpose skill-discovery algorithm. We use the MuJoCo environments from OpenAI gym as our testbed, where we show that DADS can learn diverse (yet predictable) skills without any rewards. We show some of the skills randomly sampled from the latent space learnt via DADS for different agents:

Half Cheetah

Ant

Humanoid

Online Sequence of Goals

We demonstrate how these skills can be leveraged on the downstream tasks. In particular, we choose a challenging locomotion task of following an online sequence of goals, where the agent only observes the current goal. The goal gets updated when the agent reaches an within an epsilon-ball of the current goal. We leverage the learnt skill-dynamics model to compose skills using model-predictive control, allowing the agent to navigate without any training on the downstream task.

Goals (green ellipsoids on the left and 'x' on the right) are updated online, and the agent only sees the current goal. There is no training on the task, that is, it solves it in zero-shot.

A similar demonstration for the Humanoid agent, which composes its learnt skills to follow the sequence of goals. The feasible sequence of goals is restricted compared to Ant, however, skill composition using planning can still be leveraged. The video has been sped up 2x.

Citation

@article{sharma2019dynamics,
  title={Dynamics-aware unsupervised discovery of skills},
  author={Sharma, Archit and Gu, Shixiang and Levine, Sergey and Kumar, Vikash and Hausman, Karol},
  journal={arXiv preprint arXiv:1907.01657},
  year={2019}
}