Visit camera ready site (preview below) here: https://penn-pal-lab.github.io/peg/

Below is the website used during the rebuttal.

(Left) Our method, PEG, efficiently explores the maze (blue dots are seen states) by setting goals (red dots) that maximize exploration. PEG explores more efficiently than similar goal-setting baselines (right)  by planning goals with a world model.

Abstract

Dropped into an unknown environment, what should an agent do to quickly learn about the environment and how to accomplish diverse tasks within it? We address this question within the goal-conditioned reinforcement learning paradigm, by identifying how the agent should set its goals at training time to maximize exploration. We propose "planning exploratory goals" (PEG), a method that sets goals for each training episode to directly optimize an intrinsic exploration reward. PEG first chooses goal commands such that the agent's goal-conditioned policy, at its current level of training, will end up in states with high exploration potential. It then launches an exploration policy starting at those promising states. To enable this direct optimization, PEG learns world models and adapts sampling-based planning algorithms to "plan goal commands". In challenging simulated robotics environments including a multi-legged ant robot in a maze, and a robot arm on a cluttered tabletop, PEG exploration enables more efficient and effective training of goal-conditioned policies relative to baselines and ablations. Our ant successfully navigates a long maze, and the robot arm successfully builds a stack of three blocks upon command

Block Stacking

Only PEG is able to make meaningful progress in the block stacking task, as seen in the training curves.

Below, we evaluate PEG  and other methods on different 3-block stacking goals (colored orbs).  We can see that while PEG and baselines can pick up the blocks, only PEG is able to complete the task.

PEG Eval #1

PEG Eval #2

PEG Eval #3

PEG Eval #4

Baseline Eval #1

Baseline Eval #2

Baseline Eval #3

Baseline Eval #4

PEG Exploration

Next, we analyze training-time exploration episodes of PEG. We are especially interested in episodes where 3-block towers are experienced, so we visualize these episodes. 

Right GIF: PEG proposes a goal state where red and blue blocks are in the air.  The goal-conditioned policy holds both blocks in midair between the red and blue orbs. This results in the 3-block stack. 

Below, we visualize some more training episodes with 3-block states. 

Note the robot is invisible for visualization purposes.

PEG Train #1

PEG Train #2

PEG Train #3

PEG Train #4

Ant Maze

The agent must control a high-dimensional (30-D) ant robot through a maze.  The episode length is 500 timesteps, making it a long-horizon task. As seen in the right, PEG outperforms other methods in both learning speed and goal-conditioned policy optimality. 


Below, we visualize goal-conditioned policy behavior of PEG at convergence: it is able to reliably reach all 8 goals.

Top Row: Goal states; we evaluate methods on their ability to reach goals varying in position and orientation.

Bottom row: PEG's goal-conditioned policy's behavior for their corresponding goal (above in the same column). We show a black image when the agent achieves the goal.

Exploration Visualization

Below, we visualize the goals (red dots) and explored states (green dots) chosen by various methods halfway through the training.  PEG explores the deepest part of the maze, whereas other methods barely reach the middle. 

Why does PEG propose goals near the start (bottom left)? The answer is that this figure is only a 2D projection of the exploration in the XY dimension. The state space of the Ant Maze is actually 29 dimensional, and when we sample the goals in the bottom left, we see that they actually are very interesting in the ant joint positions. PEG proposes goals that makes the ant hover, flip sideways and upside down, or clip into the ground. 

Point Maze  & Walker

PEG  compares favorably to baselines in 2D point-maze navigation and walking.

Below, we visualize the goal-conditioned policies trained by PEG.

Point Maze

Walker

Point Maze: PEG is able to reach the top right corner, the hardest to reach goal from the bottom left.

Walker: (Top Row) Goal poses set ±6, ±13 meters away from the initial walker.(Bottom row) PEG's goal-conditioned policy conditioned on the corresponding goal.

Exploration Visualization

Similar to the Ant Maze visualization, we plot the states (blue / green dots) and chosen goals (red dots) for PEG and MEGA. PEG explores the environment more quickly than baselines, as seen in the top right region of the maze, and the left and right hand sides of Walker. 

State Sampling Experiment

We ran an experiment in the Ant Maze to test the effect of state space sampling versus replay buffer sampling. 


First, we create a variant of MEGA that samples from the state space. We call this variant MEGA-Unrestricted.  Next, we create a variant of PEG that is restricted to sampling from the replay buffer to initialize the MPPI optimizer. We plot the mean and standard error of 5 seeds.


We can see that MEGA-Unrestricted performs much worse than the original MEGA, validating that this method, representative of other methods in this vein including Skewfit, does not permit unrestricted optimization over the goal space, unlike our approach PEG.