INFOrmation Prioritization through EmPOWERment in Visual Model-based RL

Homanga Bharadhwaj, Mohammad Babaeizadeh, Dumitru Erhan, Sergey Levine

Carnegie Mellon University Google Research, Brain Team University of California Berkeley

Abstract

Model-based reinforcement learning (RL) algorithms designed for handling complex visual observations typically learn some sort of latent state representation, either explicitly or implicitly. Standard methods of this sort do not distinguish between functionally relevant aspects of the state and irrelevant distractors, instead aiming to represent all available information equally. We propose a modified objective for model-based RL that, in combination with mutual information maximization, allows us to learn representations and dynamics for visual model-based RL without reconstruction in a way that explicitly prioritizes functionally relevant factors. The key principle behind our design is to integrate a term inspired by variational empowerment into a state-space learning model based on mutual information. This term prioritizes information that is correlated with action, thus ensuring that functionally relevant factors are captured first. Furthermore, the same empowerment term also promotes faster exploration during the RL process, especially for sparse-reward tasks where the reward signal is insufficient to drive exploration in the early stages of learning. We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds, and show that the proposed prioritized information objective outperforms state-of-the-art model based RL approaches with higher sample efficiency and episodic returns.

Information Prioritization for the Latent State-Space Model

Overview of InfoPower. The empowerment objective prioritizes encoding controllable representations in in the latent states. The forward information objective helps learn a latent forward dynamics model so that future latents can be predicted from current the current latent state and the current action. The reward objective helps learn a reward prediction model, such that the agent can learn a plan (sequence of actions) through latent rollouts. The contrastive learning objective facilitates learning an encoder that maps from image observations to latent states. Together, this combination of terms produces a latent state space model for MBRL that captures all necessary information at convergence, while prioritizing the most functionally relevant factors via the empowerment term.

Qualitative Results

t-SNE plot of latent states with visualizations of three nearest neighbors for two randomly sampled points (in red frame). We see that the state of the agent is similar is each set for InfoPower, whereas for Dreamer, and the most competitive baseline C-Dreamer, the nearest neighbor frames have significantly different agent configurations.

Visualizations of runs for InfoPower with video background distractors in different environments after 2M environment interactions. The environments in order are Finger Spin, Cheetah Run, Quadruped Walk, Quadruped Run, Walker Run, and Hopper Hop