Policy Continuation with Hindsight Inverse Dynamics

Hao Sun*, Zhizhong Li, Xiaotong Liu, Dahua Lin, Bolei Zhou

* sh018@ie.cuhk.edu.hk

The Chinese University of Hong Kong

Peking University

Abstract

Solving goal-oriented tasks is an important but challenging problem in reinforcement learning (RL). For such tasks, the rewards are often sparse, making it difficult to learn a policy effectively. To tackle this difficulty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabling the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. On two multi-goal tasks GridWorld and FetchReach, PCHID significantly improves the sample efficiency as well as the final performance.

Key Insights

The key insight of PCHID is in the self-imitate learning and curriculum learning abilities of human. And most importantly, the ability of learning to be success from failure, as was first introduced by HER. Different from HER, PCHID provides a supervised learning approach for utilizing hindsight knowledge.

Extension and Future Directions

We attribute the success of PCHID to the Extrapolation ability of sub-policies. That is, when you are able to play the Tower of Hanoi with 3 disks, it is not hard to learn how to play with 4 disks. Mastering simple skills can help the agent to become more aware of the goal in their learning, therefore, the exploration needed will be reduced. So the PCHID can be interpreted as a kind of curriculum learning.

We further investigated the possibility of training PCHID with a synchoronous improvement, namely Policy Evolution with Hindisight Inverse Dynamics (PEHID). And we also reveal some interesting relations between PEHID and the Ornstein–Uhlenbeck process. The work has been accepted by the OptRL workshop in NeurIPS 2019, paper and code will be released soon.

Demo Video: Success Cases

Demo Video: Fail Cases