Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, Debidatta Dwibedi

Abstract: Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervised methods require large, labeled datasets for each task that are expensive to collect in the real-world. Using self-supervised learning to obtain representations from unlabeled data can mitigate this problem. However, current self-supervised representation learning methods are mostly object agnostic, and we demonstrate that the resulting representations are insufficient for general purpose robotics tasks as they fail to capture the complexity of scenes with many components. In this paper, we show the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and is queried in two different settings: (i) policy learning and (ii) object location prediction. We show that our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20% increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC). Furthermore, our method outperforms the baselines for the task of object localization in multi-object scenes.

Paper

International Conference on Robotics and Automation (ICRA) 2023

Arxiv:  https://arxiv.org/abs/2205.06333

Qualitative Examples of Masks Learned by Slot Attention on Real-world Data

Image                  Slot 1            Slot 2            Slot 3            Slot 4            Slot 5             Slot 6             Slot 7           Slot 8          Slot 9          Slot 10       Slot 11      Slot 12       Slot 13         Slot 14           Slot 15         Slot 16

Quantitative Results 

 We show the effectiveness of using object-aware representation learning techniques for robotic tasks. Our self-supervised representations are learned by observing the agent freely interacting with different parts of the environment and are queried in two different settings: (i) policy learning and (ii) object location prediction.

(i) Policy learning:  Our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object agnostic techniques as well as methods trained on raw RGB images. Our results show a 20% increase in performance in low data regimes (1000 trajectories) in policy training using implicit behavioral cloning (IBC).

(a) Validation performance of different methods during IBC training in low data regime (1000 episodes). It can be observed that using Slot Attention leads to a 20% performance increase. As an upper bound, using ground truth segmentation masks resulted in about 40% performance improvement. Solid lines show the mean across 4 seeds and the shaded area indicates 1 standard deviation from each side.  (b) Performance comparison on policy learning over episodes in training data. Slot Attention based representations provide a performance boost in the low data regimes.

(ii) Object localization:  Our method outperforms the baselines for the task of object localization in multi-object scenes.

Performance comparison on object localization over number of blocks present in the scene using Probability of Correct Keypoint (PCK) as evaluation metric. Higher percentage indicates better performance. The baselines are able to learn the location of one object in the scene (the end-effector) resulting in a low performance on average that decreases as the number of blocks increases. Slot Attention is able to localize multiple objects but sometimes struggles with objects of same color but different shapes in the 8 block scenario.

Qualitative example of the model’s performance on object location prediction. As baseline, performance of a representation learned using MoCo is shown on the middle column. In a perfect prediction, shapes should overlap with the corresponding matching circles of those objects. It can be seen that this is the case for Slot Attention in the 1 and 4 block case. For the 8 block case, it can be seen that the predicted slots are closer to the ground truth predictions than that of the MoCo baseline.

Citation 

@INPROCEEDINGS{HeraviSlots4Robots,

  author={Heravi, Negin and Wahid, Ayzaan and Lynch, Corey and Florence, Pete and Armstrong, Travis and Tompson, Jonathan and Sermanet, Pierre and Bohg, Jeannette and Dwibedi, Debidatta},

  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)}, 

  title={Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations}, 

  year={2023}

}

Presentation Video

Acknowledgements

 Work done as an intern at Google. Toyota Research Institute ("TRI") provided partial funds to assist the author with their research, but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.