Explicit-Implicit Subgoal Planning for Long-Horizon Tasks with Sparse Reward

Fangyuan Wang, Anqing Duan, Peng Zhou, Shengzeng Huo, Guodong Guo, Chenguang Yang, and David Navarro-Alarcon

Abstract

The challenges inherent in long-horizon tasks in robotics persist due to the typical inefficient exploration and sparse rewards in traditional reinforcement learning approaches. To alleviate these challenges, we develop a novel algorithm called Variational Autoencoder-based Subgoal Inference (VAESI), to tackle long-horizon tasks using a divide-and-conquer approach. We employ two criteria, feasibility, and optimality, to guarantee the quality of generated subgoals. VAESI aims to generate a sequence of subgoals that are feasible and optimal for the robot to achieve the final goal. It consists of three components: a Variational Autoencoder (VAE)-based Subgoal Generator, a Hindsight Sampler, and a Value Selector. The VAE-based subgoal generator uses an explicit model to infer subgoals and an implicit model to predict the final goal, inspired by way of human thinking that infers subgoals by using the current state and final goal as well as reason about the final goal conditioned on the current state and given subgoals. Additionally, the Hindsight Sampler selects valid subgoals from an offline dataset to enhance the feasibility of the generated subgoals. While the Value Selector utilizes the value function in reinforcement learning to filter the optimal subgoals from subgoal candidates.

Method

Quanlititive Results

This section demonstrates the feasibility and optimality of inferred subgoals through the qualitative result.

Feasibility

Subgoal sequences generated by option policy from the initial state to the desired goal on AntMaze, PushOverObstacle, Stack, and Push4. The red sphere in AntMaze and transparent cubes in manipulation tasks are marked as subgoals of current timesteps.

Subgoal sequences generated by option policy on state-based environments. The transparent cubes in Stack and Push tasks, as well as the red and blue cubes in OpenDrawer and Store tasks, represent subgoals generated by VAESI.

Subgoal sequences generated by option policy on image-based environments. The figures with yellow dashed rectangles (below) are the subgoals in the current observation (above).

Optimality

Subgoal heatmap of high-dimensional subgoal space visualized by the t-SNE algorithm. The colors are granted according to the magnitude of the V value. Darker colors indicate better subgoals. It is shown that the distributions of the subgoals (marked in green crosses) generated by the option policy share a high V value.

Simulation

In this section, we demonstrate trajectories of subgoals in four tasks, AntMaze, PushOverObstacle, Stack and Push4.

Stack

The Stack tasks consists of picking up three blocks and placing them in sequence at the target location.

Push4

The Push4 tasks combines four block-pushing tasks that push four blocks to their target position.

OpenDrawer

The robot needs to open the drawer by first removing the block and then pulling the handle. We evaluate VAESI in both the state-based and image-based environment. In the state-based environment, the observation space is 42 dimensional, containing information about the drawer, the handle, the block and the gripper. In the image-based environment, the observation space is a 3×48×48 image. The task is challenging due to it contains multiple skills (Pushing the block and pulling the handle) and task dependencies.

Store

This task requires the robot to first open the drawer, then pick and place the block into the drawer, and finally close the drawer. Multiple skills like pushing, picking, placing and grasping are involved in this task. Like the OpenDrawer task, we also conduct experiments in both the state-based and image-based environment. The observation space in state-based and image-based is the same as in the OpenDrawer task.

Real World

This section shows full episodes (8x speed) of policies trained with VAESI and running on the real robots. These videos are grouped by different manipulation tasks, and each video within a group shows a different initial state and final goal. The green spherical shadows and transparent squares are used to represent the final goal and the subgoals generated during the execution.