Hierarchical Reinforcement Learning under Mixed Observability

Hai Nguyen*, Zhihan Yang*, Andrea Baisero, Xiao Ma, Robert Platt, Christopher Amato

International Workshop on the Algorithmic Foundation of Robotics (WAFR), 2022

Email: nguyen.hai1@northeastern.edu

Abstract

The framework of mixed observable Markov decision processes (MOMDP) models many robotic domains in which some state variables are fully observable while others are not. In this work, we identify a significant subclass of MOMDPs defined by how actions influence the fully observable components of the state and how those, in turn, influence the partially observable components and the rewards. This unique property allows for a two-level hierarchical approach we call HIerarchical Reinforcement Learning under Mixed Observability (HILMO), which restricts partial observability to the top level while the bottom level remains fully observable, enabling higher learning efficiency. The top level produces desired goals to be reached by the bottom level until the task is solved. We further develop theoretical guarantees to show that our approach can achieve optimal and quasi-optimal behavior under mild assumptions. Empirical results on long-horizon continuous control tasks demonstrate the efficacy and efficiency of our approach in terms of improved success rate, sample efficiency, and wall-clock training time. We also deploy policies learned in simulation on a real robot.

Slides ArVix .00898

Video.00898

Hierarchical Reinforcement Learning under Mixed Observability

Memory-based Top-Level Policy

  • Consume high-level observation created from k previous primitive observations emitted when the bottom level acts. The high-level observation is created using a Summarizer module

  • Trained using stationary episodes of transitions in which goals are always met by modifying the transition of the bottom-level policy

  • Output a goal for the bottom-level policy to achieve

Memory-less Bottom-Level Policy

  • Achieve any goal given by the top-level policy within k timesteps

  • Trained on transitions, using goal-relabeling (Hindsight Experience Replay)

Roll-out over time

Domains w/ Information Gathering and Memorization

Two-Boxes

A finger is velocity-controlled on a 1D track to perform a dimension check of two boxes. Since the finger is always compliant, it will be deflected from the vertical axis when it glides over a box. The agent observes the finger's position and angle but not the positions of the two boxes. Therefore, an optimal agent must localize both boxes and determine their sizes using the history of angles and positions. When the two boxes have the same size, the agent must go to the right end to get a non-zero reward and otherwise to the left end.

Ant-Heaven-Hell

An ant with four legs moving in a 2D T-shaped world will receive a non-zero reward by reaching a green area (heaven) that can be on the left or the right corner of a junction. The ant receives a penalty when entering a red area (hell). When it stays in the blue ball, it can observe heaven's side (left/right/null). The ant starts randomly around the bottom corner. An optimal agent must visit the blue region to observe heaven's side, memorize the side while going to heaven, and finally goes to heaven.

Ant-Tag

The ant now has to search and ``tag'' a moving opponent by having the opponent inside the green area centered at the ant. Both start randomly but not too close to each other. The opponent follows a fixed stochastic policy, moving a constant distance away from the ant 75% of the time or staying otherwise. Observation includes the joints' angles & velocities of four legs and the 2D coordinate of the opponent, containing the opponent's position only when it is inside the visibility (blue) area centered at the ant.

Door-Push

A 3-DoF gripper in 3D must successfully push a door to receive a non-zero reward. The door, however, can only be pushed in one direction (front-to-back or vice versa), and the correct push direction is unknown. The agent can observe the joints' angles and velocities and the door's angle. Starting each episode, the door is present to the gripper, initialized with a random pose. An optimal agent must infer the correct push direction from the history of observations.

Results: More Efficient Learning

Train Faster & Explore Better


HILMO allows faster training (due to shorter episodes of the top level, and the two levels are trained in parallel). HILMO also explores better (above shown the position of the agent in Ant-Heaven-Hell)

Sim_2_Real Transfer & Policy Visualization (Two-Boxes)