Towards Learning to Imitate from a Single Video Demonstration

Enabling imitation learning from a single video demonstration via recurrent siamese networks

Abstract

Agents that can learn to imitate given video observation -- without direct access to state or action information are more applicable to learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function comparing an agent's behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improves policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and a quadruped and a humanoid in 3D. We show that our method outperforms current state-of-the-art techniques in these environments and can learn to imitate from a single video demonstration.

Pupper

Imitation result for Stanford Quadruped robot simulated. Video behavior on the left, learned agent on the right.


pupper_mocap.mp4

Behaviour to imitate from video observation

pupper.mp4

Learned motion from ViRL

Dog2D

Imitation result for simulated dog robot. Three trajectories side-by-side with the agent in gray and the kinematically controlled demonstration on the right in blue.

dog2d.mp4


Raptor2D

Imitation result for simulated raptor robot. Three trajectories side-by-side with the agent in gray and the kinematically controlled demonstration on the right in blue.

raptor.mp4