Deep Reinforcement Learning amidst Lifelong Non-Stationarity

Abstract: As humans, our goals and our environment are persistently changing throughout our lifetime based on our experiences, actions, and internal and external drives. In contrast, typical reinforcement learning problem set-ups consider decision processes that are stationary across episodes. Can we develop reinforcement learning algorithms that can cope with the persistent change in the former, more realistic problem settings? While on-policy algorithms such as policy gradients can in principle be extended to non-stationary settings, the same cannot be said for more efficient off-policy algorithms that replay past experiences when learning. In this work, we formalize this problem setting, and draw upon ideas from the online learning and probabilistic inference literature to derive an off-policy RL algorithm that can reason about and tackle such lifelong non-stationarity. Our method leverages latent variable models to learn a representation of the environment from current and past experiences, and performs off-policy RL with this representation. We further introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.

Standard RL-as-Inference

Dynamic Parameter MDP

Left: The graphical model for the RL-as-Inference framework consists of states s, actions a, and optimality variables O. By incorporating rewards through the optimality variables, learning an RL policy amounts to performing inference in this model. Right: The graphical model for the DP-MDP. Each episode presents a new task, or MDP, determined by latent variables z. The MDPs are further sequentially related through a transition function p(z'|z).

An overview of our network architecture. Our method consists of the actor, the critic, an inference network, a decoder network, and a learned prior over latent embeddings. Each component is implemented with a neural network. At execution time, the actor and critic both take as input, the latent variables from the learned prior. The reconstruction loss and learned prior provide additional learning supervision for the inference network.

Experimental Results

Environments

The Sawyer robot has to reach for the goal, which moves between episodes and is unobserved.

The half cheetah agent, which experiences varying wind forces, has to run at different target velocities in each episode.

The minitaur agent must carry variable payloads and move at a fixed target speed.