Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning

Abstract

It is of significance for an agent to autonomously explore the environment and learn a widely applicable and general-purpose goal-conditioned policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, one natural approach is to reward the agent with a prior non-parametric distance over the embedding spaces of states and goals. However, this may be infeasible in some situations, either because it is unclear how to choose suitable measurement, or because embedding (heterogeneous) goals and states is non-trivial. The key insight of this work is that we introduce a latent-conditioned policy to provide goals and intrinsic rewards for learning the goal-conditioned policy. As opposed to directly scoring current states with regards to goals, we obtain rewards by scoring current states with associated latent variables. We theoretically characterize the connection between our unsupervised objective and the multi-goal setting, and empirically demonstrate the effectiveness and efficiency of our proposed method which substantially outperforms prior techniques in a variety of robotic tasks. ArXiv

Learned goal-conditioned behaviors:

Illustrative example

Gridworld (Learning Process). The goal for the agent is to navigate from the middle to the given colored room.

The Y-axis is the percentage of different rooms that the robot arrives in.

Learned goal-conditioned behaviors:

2D Navigation (discrete and continous)

The open circle denotes the goal, and line with a small solid circle at the end indicates a trajectory.






Object Manipulation

The line with a small solid circle at the end indicates the trajectory.

Push blue circle to [6.7, 8.0].

Push green square to [5.3, 3.9].

Push blue trangle to [6.3, 5,8].

Push red square to [8.7, 9.2].

State Imitation:

Swimmer (Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

HalfCheetah(Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

Robot(Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

Montezuma Revenge(Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

Seaquest(Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

Berzerk(Left: goal behavior; Middel: GPIM behavior; Right: stacked view of goal and GPIM behaviors.)

ADDITIONAL EXPERIMENT on temporally-extended tasks (Imitate trajctories of hand)