Implicit Subgoal Planning with Variational Autoencoders for Long-Horizon Sparse Reward Robotic Tasks

Fangyuan Wang, Anqing Duan, Peng Zhou, Shengzeng Huo, Guodong Guo, Chenguang Yang, and David Navarro-Alarcon

Abstract

The challenges inherent to long-horizon tasks in robotics persist due to the typical inefficient exploration and sparse rewards in traditional reinforcement learning approaches. To alleviate these challenges, we introduce a novel algorithm, Variable Autoencoder-based Subgoal Inference (VAESI), to accomplish the long-horizon task through a divide-and-conquer approach. VAESI consists of three components: the Variational Autoencoder (VAE)-based Subgoal Generator, the Hindsight Sampler, and the Value Selector. Among them, the VAE-based Subgoal Generator is inspired by the fact that humans can not only infer subgoals to decompose a long-horizon task but also reason about the final result given the subgoals. It contains an explicit encoder model for generating subgoals and an implicit decoder model for improving the quality of the generated subgoals. Additionally, the Hindsight Sampler samples valid subgoals from an offline dataset to enhance the feasibility of the generated subgoals. The Value Selector utilizes the value function in reinforcement learning to filter the optimal subgoals from candidates. We conduct several long-horizon tasks, including one locomotion task and three manipulation tasks, in both simulation and the real world. The quantitative and qualitative results indicate that our approach achieves promising results in terms of both subgoal effectiveness and overall performance compared to other baseline methods.

vaesi4.mp4

Method

Quanlititive Results

This section demonstrates the feasibility and optimality of inferred subgoals through the qualitative result.

Feasibility

Subgoal sequences generated by option policy from the initial state to the desired goal on AntMaze, PushOverObstacle, Stack, and Push4. The red sphere in AntMaze and transparent cubes in manipulation tasks are marked as subgoals of current timesteps.

Optimality

Subgoal heatmap of high-dimensional subgoal space visualized by the t-SNE algorithm. The colors are granted according to the magnitude of the V value. Darker colors indicate better subgoals. It is shown that the distributions of the subgoals (marked in green crosses) generated by the option policy share a high V value.

Evolution of V values of subgoals in the PushOverObstacle task from the initial state to the desired goal. The last image integrates the entire evolution of the highest V value in the subgoal space.

Simulation

In this section, we demonstrate trajectories of subgoals in four tasks, AntMaze, PushOverObstacle, Stack and Push4.

AntMaze

AntMaze is a locomotion task performed in a U-shaped maze containing an 8-degree-of-freedom ant that tries to reach the final goal (the green sphere) through repeated trials.

Trajectory 1

Trajectory 2

Trajectory 3

Trajectory 4

PushOverObstacle

PushOverObstacle task is an extension of the Push task, where the initial state (green squares) and desired goal (green sphere) are uniformly spawned on different sides of an obstacle. The obstacle is a static wall that cannot be moved and is marked in red. The robot aims to bypass the obstacle to reach the desired position.

Trajectory 1

Trajectory 2

Trajectory 3

Trajectory 4

Stack

The Stack tasks consists of picking up three blocks and placing them in sequence at the target location.

Trajectory 1

Trajectory 2

Trajectory 3

Trajectory 4

Push4

The Push4 tasks combines four block-pushing tasks that push four blocks to their target position.

Trajectory 1

Trajectory 2

Trajectory 3

Trajectory 4

Real World

This section shows full episodes (8x speed) of policies trained with VAESI and running on the real robots. These videos are grouped by different manipulation tasks, and each video within a group shows a different initial state and final goal. The green spherical shadows and transparent squares are used to represent the final goal and the subgoals generated during the execution.