ROLL: Visual Self-Supervised Reinforcement Learning with Object Reasoning

Yufei Wang*, Gautham Narayan Narasimhan*, Xingyu Lin, Brian Okorn, David Held

Conference on Robot Learning (CoRL), 2020

While reinforcement learning has made great progress on many robotics control tasks, current image-based RL algorithms typically operate on the whole image without performing object-level reasoning, which leads to inefficient goal sampling and ineffective reward functions. In this paper, we improve upon previous visual self-supervised RL by incorporating object-level reasoning and occlusion reasoning. Specifically, we propose unknown object segmentation to ignore distractors in the scene for better reward computation and goal generation; we further enable occlusion reasoning by employing a novel auxiliary loss and training scheme. We demonstrate that our proposed algorithm, namely ROLL (Reinforcement learning with Object Level Learning), learns dramatically faster and achieves better final performance compared with previous methods in several simulated visual control tasks.

Summary Video


ROLL: Reinforcement Learning with Object-level Learning

This work improves upon previous visual self-supervised reinforcement learning algorithms by incoporate object reasoning. We use unknown object segmentation to remove static background and robot arms, and learn a reward / goal-condition latent embedding using segmented object images. We further employ LSTM and design a novel matching loss and training scheme to make the method robust to object occlusions.

Results on 5 Mujoco Visual Tasks

Puck Pushing

Hurdle-Bottm Puck Pushing

Hurdle Top Puck Pushing

Door Opening

Object Pickup

Learning Curves

Policy Video

We see that ROLL always successfully aligns the target object, while skewfit often aligns the arm and fails to align the target object.


Correlation between Object Distance and Object VAE latent distance

To analyze why our method performs so well, we verify if the reward function derived from the latent space of the object-VAE is better than that derived from the scene-VAE. For a better reward function, the distance in the latent space should better approximate the distance to the real object, e.g., the puck/ball distance in the pushing/pickup tasks and the door angle distance in the door opening task. In the figure below, we plot the object distance along the x-axis and the latent distance along the y-axis, where the distance is measured between a set of observation images and a single goal image. A good latent distance should scale roughly linearly with the real object distance. As can be seen, the latent distance from the object-VAE is much more accurate and stable in approximating the real object distance in all five tasks.

Ablation Study

To test whether each component of our method is necessary, we perform ablations of our method in the Hurdle-Top Puck Pushing task, which has large occlusions on the optimal path. We test three variants of our method: ROLL without matching loss, which does not add the matching loss to the LSTM output or the object-VAE latent embedding; ROLL without LSTM and matching loss, which does not add an LSTM after the object-VAE and uses no matching loss; ROLL without object-VAE, which replaces the object-VAE in ROLL with the scene-VAE but still uses an LSTM and the matching loss. As can be seen, both the LSTM and the matching loss is essential to the good performance of ROLL.


This work was supported by the United States Air Force and DARPA under Contract No. FA8750-18-C-0092, the National Science Foundation under Grant No. IIS-1849154, and LG Electronics. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or United States Air Force and DARPA.