Spatial And Temporal Features Unified

Self Supervised Representation Learning Networks

Authors: Rahul Choudhary, Rahee Walambe, Ketan Kotecha

Abstract: Robot manipulation tasks can be carried out effectively, provided the state representation is satisfactorily detailed. Embodiment difference, Viewpoint difference, and Domain difference are some of the challenges in learning from human demonstration. This work proposes a self-supervised and multi-viewpoint spatial and temporal features unified representation learning method. The algorithm consists of two components: (a) Spatial Component, which learns the setting of the environment, i.e., on which pixels to focus on most to get the best representation of the image regardless of point of view, and (b) Temporal Component that learns how snapshots taken from multiple viewpoints simultaneously (i.e., at the same time-step but from a different viewpoint) are similar and how these snaps are different from snaps taken at a different time-step but same viewpoint. Further, these representations are integrated with the Reinforcement Learning (RL) framework to learn accurate behaviors from videos of humans performing the manipulation task. The effectiveness of this approach is illustrated by training the robots to learn various manipulation tasks i.e., (a) grab objects (b) lift objects (c) open and close drawers from expert demonstrations provided by humans. The algorithm shows great promise and is highly successful across all the manipulation tasks. The robot learns to pick up objects of various shapes, sizes and colors having different orientations and placements on the table. The robot also successfully learns how to open and close drawers. The method is highly sample efficient and addresses the challenges of embodiment, viewpoint, and domain difference.

Lift Task Final Policy Videos:


Drawer Task Final Policy Videos: