HRL4IN: Hierarchical Reinforcement Learning for Interactive Navigation with Mobile Manipulators

Chengshu (Eric) Li, Fei Xia, Roberto Martín-Martín, Silvio Savarese

Stanford University

{chengshu, feixia, robertom, ssilvio}

CoRL 2019


In recent years, reinforcement learning has achieved new level of success in solving robotics tasks. However, such success has mostly been restricted to either navigation or manipulation with stationary arms. We believe that many common tasks in real human environment, however, require an agent capable of doing both navigation and manipulation, and sometimes, a simultaneous combination of both.

A common case is robot navigation in realistic indoor environments. This task requires agent's interactions with the environment such as opening doors, pressing buttons or pushing obstacles away. We call the family of navigation tasks that require interactions Interactive Navigation. Solving Interactive Navigation requires a mobile base with interactive capabilities, a so-called mobile manipulator. Such agent can change its location in the environment through navigation and change the environment's configuration through manipulation.

Interactive Navigation tasks are intrinsically multi-phase or heterogeneous. In some phases, the agent may only need to navigate using its base alone, or manipulate using its arm alone. In other phases, however, the agent may need to move its base and arm in coordination for a subtask (say, opening a door). A naive use of the entire embodiment for every phase of the task can lead to energy inefficiency and undesired collisions.


In this work, we present HRL4IN, a hierarchical reinforcement learning solution for Interactive Navigation that learns an efficient use of pure navigation, pure manipulation and their combination in different phases of the task.

Our solution is composed of two levels of hierarchy (depicted below).

The high-level (HL) policy operates at a coarser time scale than the low-level (LL) policy. It receives the observation from the environment and generates a high-level action, conditioned on the final goal. This high-level action includes a subgoal, which represents a desired change of certain components of the observation (e.g. the position of the agent's end-effector), and an embodiment selector, which commands which parts of the embodiment to use for this subgoal: navigation (base-only), manipulation (arm-only), or both. The high-level action is fixed for the next T time steps.

The low-level (LL) policy receives the same observation and generates a robot command (e.g. joint velocity), conditioned on the last high-level action. The low-level policy is rewarded by the high-level policy for getting closer to the subgoal with what we call the intrinsic reward. The embodiment selector plays two crucial roles for the low-level policy. First, it is used to compute an action mask that deactivates components of the embodiment that are not needed for the current subgoal (e.g. if base-only is selected, the joint velocity of the end-effector will be set to 0). Second. it is used to compute a subgoal mask that ignores certain dimensions of the subgoal when computing the intrinsic reward and deciding whether the low-level policy has converged to the subgoal (e.g. if base-only is selected and the subgoal is a desired position of the end-effector, the z-dimension of the subgoal will be ignored because the robot is not able to change the height of its end-effector).

Environment Setup

We tested our algorithm in two environments.

Interactive ToyEnv is a grid-world environment with discrete state and action space. It consists of k x k cells, where each cell can be free space, wall, door, or occupied by the agent. The agent is a simplification of a mobile manipulator that can navigate across different cells and open the door if it is positioned at the cell in front of the door and facing it.

Interactive GibsonEnv is a photo-realistic, physically simulated 3D environment built upon Gibson Environment with continuous state and action space. The room size is approximately 6m x 9m. The agent is a mobile manipulator embodied as JackRabbot, which is composed by 6 DoF Kinova arm mounted on a non-holonomic two-wheeled Segway mobile base. To simplify the contact physics, we assume the agent grabs the door handle when the end-effector is close enough by creating a ball joint between the two bodies.

The below figures depict both environments and JackRabbot. Initial positions (red), goal positions (blue) and subgoals (yellow) are visualized.

Interactive ToyEnv

Interactive GibsonEnv



We use PPO for both the high-level and the low-level policies of HRL4IN. We compare HRL4IN with the flat PPO algorithm and Hindsight Actor Critic (HAC) and show that

  1. HRL4IN consistently achieves higher reward than its baselines for the Interactive Navigation task

  2. HRL4IN uses different types of the embodiment in different phases of the task. The embodiment selection matches human intuition and helps the agent to save energy and avoid unnecessary collisions

In the following figures, we showcase

  1. reward over time for HRL4IN compared against its baselines

  2. sample trajectories of successfully trained HRL4IN agents

  3. embodiment selection at different positions of the environments

Interactive ToyEnv

Reward over time for HRL4IN and flat PPO

Set the subgoal (yellow) near the door front
Set the subgoal right on the other side of the door
Set the subgoal at the final goal

Probability of using different parts of the embodiment; the high- level policy learns to set base-only subgoals everywhere except when the agent is near the door, where it sets base+arm subgoals. Arm-only usage is excluded because it’s overly restrictive and never used.

Interactive GibsonEnv

Reward over time for HRL4IN, flat PPO and HAC

Success rate over time for HRL4IN, flat PPO and HAC



Scatter plot of embodiment selections at different locations of 100 trajectories; the high-level policy learns to set base-only subgoals at most of the locations to save energy and set base-arm subgoals when in front of the door to grasp the door handle.

Navigate closer to the door (base-only)

Tries to grab the door handle (base + arm)

Move backward to open the door (base-only)

Head to the final goal (base-only)

Policy Visualization