Environment Agnostic Reinforcement Learning
Generalization capability of vision-based deep reinforcement learning (RL) is indispensable to deal with dynamic environment changes that exist in visual observations. The high-dimensional space of the visual input, however, imposes challenges in adapting an agent to unseen environments. In this work, we propose Environment Agnostic Reinforcement learning (EAR), which is a compact framework for domain generalization of the visual deep RL. Environment-agnostic features (EAFs) are consistently disentangled by leveraging three novel objectives based on feature factorization, reconstruction, and state shift constraints, so that policy learning is accomplished only with vital features. EAR is a simple single-stage method with a low model complexity and a fast inference time, ensuring a high reproducibility, while attaining state-of-the-art performance in the DeepMind Control Suite and DrawerWorld benchmarks.
Self-Supervised Learning for RL
Self-supervised learning attempts to extract meaningful features from only unlabeled input images by defining a pretext task (e.g., rotation prediction, jig-saw puzzle solving, and context prediction) or leveraging contrastive learning. In recent years, the self-supervised learning has been actively leveraged in the field of RL. Inspired by standard self-supervised learning adopting auxiliary tasks without supervision, PAD minimizes the RL and self-supervised objectives jointly to adapt a pretrained policy to an unseen environment with no reward. VAI extracts visual foreground through unsupervised keypoint detection and visual attention to deliver an invariant visual feature to RL policy learner. Unlike leveraging self-supervised learning for adapting domain or obtaining visual foreground, we explicitly learn environment-agnostic feature from an image with self-supervised feature factorization, so that policy learning is accomplished only with the vital features.
Domain Generalization for RL
Domain generalization for RL has received significant attention over the past few years. One approach is to enhance the robustness of policies against the visual changes. RARL attempt to learn a robust policy against extra disturbances by modeling differences between training and test scenarios, while SEPT proposed a general framework for single episode transfer, which rapidly infer latent variables and exploit them as an input for a universal policy. Several works explore data augmentation techniques to improve the generalization capacity of policy. For example, RAD achieve a significant improvement via random translation and random amplitude scales, while DrAC automatically find the most effective augmentation with regularization terms for the policy and value function. Rather than learning policies solely from augmented data, SODA aims to decouple augmentation from policy learning by using non-augmented data for policy learning while using augmented data for auxiliary representation learning. More recently, SVEA design a stabilized Q-value estimation framework for dealing with an instability issue under data augmentation in off-policy RL. Another promising approach is an adaptation to a test domain. PAD proposed to adapt a self-supervised task to obtain free training signal during deployment. On the other hand, VAI extracts a universal visual foreground mask to feed invariant observation to RL. For a similar purpose, but as a simpler and more effective way, our work focuses on generating an universal representation, which is invariant to distribution shifts, through feature separation.
Overall Framework
Generalization Capability
Quantitative evaluation of episode return on DM Control Generalization Benchmark randomized color tests. EAR significantly outperforms existing state-of-the-art methods without any test time adaptation or using the additional training stages. We report our mean and standard deviation over 10 random seeds on 500K time steps. The best result on each task is in bold.
Sample efficiency
Learning curves of EAR compared to state-of-the-art methods including RAD and DrQ on DM Control Suite. We report mean (line) and standard deviation (shaded area) over 5 random seeds. Aside from improving generalization capability, EAR competes favorably with the two methods in terms of sample efficiency.
Test environments
Above figure shows the samples from the test environment used in our experiments, including DeepMind Control Generalization Benchmark, Distracting Control Suite, and DrawerWorld robotic manipulation tasks. DeepMind Control Generalization Benchmark provides two distinct benchmarks for visual generalization, (a) Randomized colors and (b) Video backgrounds. Environments (a) randomize the color of floor, and background. Environments (b) replaces the background with videos from real-life scenarios. For Environments (c)-(g), we used Distracting Control Suite of DM Control Generalization Benchmark, where camera pose, background, and colors continually change throughout an episode. The intensity indicates the degree of variations, and we provide sample images with different intensities I = {0.1, 0.2, 0.3, 0.4, 0.5}. In particular, when the intensity is I = {0.5}, the camera pose changes significantly so that the camera point is often positioned vertically above the agent, as in the third example of (g). Under this strong intensity change, most of the methods cannot work well.
Code
Environments
Ubuntu 18.04
Pytorch 1.7
Cuda 10.2
Single Titan RTX GPU
Download code and model parameters: EAR