Deep Bisimulation for Control


We study how representation learning can accelerate reinforcement learning from rich observations, such as images, without relying either on domain knowledge or pixel-reconstruction. Our goal is to learn representations that both provide for effective downstream control and invariance to task-irrelevant details. Bisimulation metrics quantify behavioral similarity between states in continuous MDPs, which we propose using to learn robust latent representations which encode only the task-relevant information from observations. Our method trains encoders such that distances in latent space equal bisimulation distances in state space. We demonstrate the effectiveness of our method at disregarding task-irrelevant information using modified visual MuJoCo tasks, where the background is replaced with moving distractors and natural videos, while achieving SOTA performance. We also test a first-person highway driving task where our method learns invariance to clouds, weather, and time of day. Finally, we provide generalization results drawn from properties of bisimulation metrics, and links to causal inference.


Consider driving down a road. The visual scene contains task-relevant and task-irrelevant details. Robust representations of the visual scene should be insensitive to irrelevant objects (e.g. clouds) or details (e.g. car types), and encode two observations equivalently if their relevant details are equal (e.g. road direction and locations of other cars).

Can we still learn control with background "distractors", replacing the usual background (top row) with moving dots, or natural videos?

Is weather important to an observation's representation? Our method learns that variation in weather is irrelevant when driving, and encodes observations similarly.

Control with Background Distraction

Representation Space

Bisimulation Representation Space Autoencoder Representation Space

A t-SNE of latent spaces learned with a bisimulation metric (left t-SNE) and VAE (right t-SNE) after training has completed, color-coded with predicted state values (higher value yellow, lower value purple). Neighboring points in the embedding space learned with a bisimulation metric have similar states and correspond to observations with the same task-related information (depicted as pairs of images with their corresponding embeddings), whereas no such structure is seen in the embedding space learned by VAE, where the same image pairs are mapped far away from each other. On the left are 3 examples of 10 neighboring points, averaged.


Left observations: Pixel observations in DeepMind Control in the default setting (top row) of the finger spin (left column), cheetah (middle column), and walker (right column), and natural video distractors (bottom row). Right training curves: Results comparing out DBC method to baselines on 10 seeds with 1 standard error shaded in the default setting. The grid-location of each graph corresponds to the grid-location of each observation.

Autonomous Driving with Visual Redundancy

Representation Space

A t-SNE diagram of encoded first-person driving observations after 10,000 training steps, color coded by value. Top: the learned representation identifies an obstacle on the right side. Whether that obstacle is a dark wall, bright car, or truck is task-irrelevant: these states are behaviorally equivalent. Left: the ego vehicle has flipped onto its left side. The different wall colors, due to a setting sun, is irrelevant: all states are equally stuck and low-value (purple t-SNE color). Right: clear highway driving. Clouds, sun position, and shadows are irrelevant to the driving task.


Table: Driving metrics, averaged over 100 episodes, after 100k training steps. Standard error shown. Arrow direction indicates if we desire the metric larger or smaller.

Figure: Performance comparison with 3 seeds on the driving tasks. Our DBC method (red) performs better than DeepMDP (purple) or learning straight from pixels without a representation (SAC, green), and much better than using contrastive losses (blue). The final performance of our method is 47% better than the next best base-line (SAC).