Unlocking Pixels for Reinforcement Learning via Implicit Attention

There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and the potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is an attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods cannot be applied beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these techniques can be successfully adopted for the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. We show this on a range of tasks from the Distracting Control Suite to vision-based quadruped robots locomotion. We provide rigorous theoretical analysis of the proposed algorithm.

Visualization of Implicit Attention for Pixels (IAP): An input RGB(D) is represented as a union of (not necessarily disjoint) patches (in principle even individual pixels). Each patch is projected via learned matrices to obtain key and query matrices. This is followed by a set of (potentially randomized) projections, which in turn is followed by a nonlinear mapping defining attention type. The resulting transformed key and query tensors define an attention matrix which is never explicitly materialized. Instead, transformed query and key tensors are efficiently multiplied with value vector. The output is a score vector in case of IAP-rank while it is an embedding in case of IAP-trans. The algorithm can in principle use a multi-head mechanism, although we do not apply it in our experiments. In the above diagram, same-color lines indicate axis with the same number of dimensions.

Visualization of locomotion behavior learned with IAP policies

Avoiding obstacles

p1dt_fast.mp4
p16dt_fast.mp4

On the top-left corner of the videos, the input camera image is attached. The red part of the camera image is the area selected by self-attention. In case of patch size 1 (top video), we can see that the policy finely detects the boundaries of the obstacles which helps in navigation. For patch size 16 (bottom video), only a single patch is selected which covers one fourth of the whole camera image. The policy identifies general walking direction but fine-grained visual information is lost.

Uneven Terrains

We test IAP to walk on the following types of randomized uneven terrains:

mpc_ss.mp4

Step-stones: The ground is made of a series of stepstones with gaps in between. The stepstones widths are fixed at 50 cm, the lengths are between [50,80] cm in length, and the gap size between adjacent stones are between [10,20] cm. The robot perceives the ground through 2 depth cameras attached to its body, one on the front and other on the belly facing downwards. The input grey-scale depth images from both cameras are attached on the top-left corner of the video. The red part of the camera images is the area selected by self-attention. Notice that IAP selects areas corresponding to the stepstones while ignoring the gaps. These are the places which are safe to step on.

Based on this selection, the policy picks favorable foot placement location and the MPC based low-level controller adjusts step length to reach desired position. Thus avoiding falling in the gap. An interesting observation about this policy is that the robot consistently uses its front right to cross the gap first.

mpc_stairs.mp4

Stairs: The robot needs to climb up a flight of stairs. The depth of each stair is uniformly randomized in the range [25,33] cm and the height is in the range [16,19] cm. IAP successfully climbs up a flight of stairs by selecting safe horizontal area on the next step.

mpc_poles_new.mp4

Grid: The ground is a grid of small square step-stones of size 15×15 sq. cm. They are separated by [13,17] cm from each other in both x and y directions. At the beginning of each episode, we also randomly rotate entire grid by an angle sampled in [−0.1,0.1] radians. IAP learns to carefully walk on the grid by attending to the stones.

Navigating Indoor Environment

video_gibson.mp4

This navigation environment has realistic visuals from the Gibson dataset. The robot observes the environment with a front depth camera view. The resolution of the camera is 16×16. We set the IAP patch size to be 4. Top 4 patches are selected by self-attention. The robot successfully passes through a narrow gate with the help of vision while navigating the environment. The policy is learned entirely from scratch in an end-to-end manner with ES methods. Both legged locomotion and high-level navigation based on vision-input are learned together. The result of the image attention module is highly interpretable as the selected patches can be visualized to understand how the policy is focusing on obstacles and pathways in the scene.