Leveraging Fully Observable Policies for Learning under Partial Observability

Hai Nguyen, Andrea Baisero, Dian Wang, Christopher Amato, Robert Platt

Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA

Email: nguyen.hai1@northeastern.edu

Abstract

Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.

OpenReview  Poster Code

Video

Motivations

Example

To reach the correct goal object, a fully observable expert takes the red path directly, while a partially observable agent must first take the green path to identify the correct goal object, then take the red path. While the expert is sub-optimal under partial observability, it can provide successful trajectories to train a partially observable policy.

Offline Training Online Execution

Assumptions

Cross-Observability Soft Imitation Learning (COSIL)

Domains Requires Information Gathering and Memorization

Learning curves of all methods. All agents (except Random and State Expert) are memory-based.

Diverse Baselines

Sim2Real in Block-Picking

Learned Policies