Leveraging Fully Observable Policies for Learning under Partial Observability
Hai Nguyen, Andrea Baisero, Dian Wang, Christopher Amato, Robert Platt
Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
Email: nguyen.hai1@northeastern.edu
Abstract
Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.
Video
Motivations
Partially observable (PO) experts are hard to obtain (require impractical computations, e.g., some sufficient statistics of the entire history like the belief states, which requires the true environment dynamics)
Fully observable (FO) experts (State Expert) are easier to obtain given the access to the states during training
Fully observable experts can be useful in training PO policies, e.g., providing guided exploration or being part of an optimal PO policy
Example
To reach the correct goal object, a fully observable expert takes the red path directly, while a partially observable agent must first take the green path to identify the correct goal object, then take the red path. While the expert is sub-optimal under partial observability, it can provide successful trajectories to train a partially observable policy.
Offline Training Online Execution
A successful RL framework where an agent can use ``privileged'' information, e.g., states, during offline training
The resultant policy can be deployed online without the need for the privileged information
Has been used in MDPs, POMDPs, and multi-agent setting
Assumptions
During training, a fully observable expert is given
During training, we can query the action of the expert given a state
Cross-Observability Soft Imitation Learning (COSIL)
Domains Requires Information Gathering and Memorization
Bumps-1D/Bumps-2D/Car-Flag: The locations of the target object (blue bump/the red bump/the green flag) is unknown
Minigrid-Memory: The matching object's type and location are unknown
LunarLander-P/V: Only either position (P) or velocity (V) of the agent can be observed
Block-Picking: Only one block (blue) can be picked, both blocks are visually the same
Learning curves of all methods. All agents (except Random and State Expert) are memory-based.
Diverse Baselines
ADV-Off: An off-policy version of ADVISOR
DAgger is a common imitation learning baseline
SAC is a recurrent version of Soft Actor-Critic (SAC)
TD3 is a recurrent version of Twin Delayed DDPG (TD3)
VRM is a SOTA model-based method for partially observable MDPs (POMDPs) using variational recurrent models
BC2SAC is pre-trained with behavior cloning (BC) loss and then trained with SAC losses
BC+SAC is trained with BC and SAC losses jointly