ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

arxiv / github

Can human visual attention help agents perform visual control tasks?

Abstract

Training autonomous agents to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is challenging and sample-inefficient. When performing a task, people visually attend to task-relevant objects and areas. By contrast, pixel observations in visual RL are comprised primarily of task-irrelevant information. To bridge that gap, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL). Using ViSaRL to learn visual scene encodings improves the success rate of an RL agent on four challenging visual robot control tasks in the Meta-World benchmark and on a real-robot setup. This finding holds with both CNN and Transformer-based visual encoder backbone architectures, with absolute gains of 13% and 18% in terms of average success rate, respectively. The Transformer-based visual encoder can achieve a 10% absolute gain in success rate even when saliency is only available during pretraining.

ViSaRL

We present Visual Saliency Reinforcement Learning (ViS-aRL, pronounced like ”Visceral”), a general approach for incorporating human-annotated saliency maps to enhance learned visual representations, thereby improving the performance on downstream tasks. The key idea of ViSaRL is to train a multimodal autoencoder that learns to reconstruct both RGB and saliency inputs, and an RL policy on top of the frozen autoencoder. By using a masked reconstruction objective for the autoencoder, our approach encourages the learned representations to encode useful visual invariances and attend to the most salient regions for downstream task learning. To circumvent the manual labor of annotating saliency maps, we train a state-of-the-art saliency predictor model using only a few human-annotated examples to augment RGB observations with saliency.


Experimental Results 

We show quantitative results of our approach with two different encoder backbones, CNN and MultiMAE, across four challenging Meta-World benchmark tasks. The figure below summarize our main findings. Incorporating saliency input substantially improves downstream task success rate irrespective of the encoder backbone. Additionally, our proposed approach of using a MultiMAE objective for fusing the saliency annotations yields the best overall task performance between all the baseline methods. 

Real-Robot Results

Video Trajectories

Pick Up Apple

Pick Up Red Block with Distractor Objects

Put Bread on Plate

Put Apple in Bowl with Distractor Objects

Masked Reconstruction with MultiMAE

MultiMAE predictions for different random masks. We visualize the masked predictions for RGB observation from each of the four tasks. For each input image, we randomly sample three different masks from a uniform distribution between RGB and saliency. Only 1/4 of the total patches are unmasked. Even when there are a few unmasked patches from one modality, the reconstructions are still very accurate due to cross-modal interaction. Saliency maps are shown with color for the purposes of visualization.