Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations instead of using an extrinsic reward function but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration. 

RoboCLIP Overview. A Pretrained Video-and-Language Model is used to generate rewards via the similarity score between the encoding of an episode of interaction of an agent in its environment, zv with the encoding of a task specifier zd such as a textual description of the task or a video demonstrating a successful trajectory. The similarity score between the latent vectors is provided as reward to the agent as a sparse reward at the end of the episode.

Using Language and In-Domain Videos to Generate Rewards

We pretrain a policy using the reward function rRoboCLIPdefined above. We then measure and report the zero-shot task rewards achieved by the agents below.  We find that pretraining on this learnt reward results in high zero-shot task success. 

Resultant Language Conditioned Policy generated using

"Robot pushing red button"

Resultant Video Conditioned Policy generated using expert video demonstration

Button-Press: Results on Using Video and Language Demonstrations in the Button Press. The above gifs visualize the zero-shot policy obtained by training on RoboCLIP rewards. 

Resultant Language Conditioned Policy generated using

"Robot closing black box"

Resultant Video Conditioned Policy generated using expert video demonstration

Door-Close: Results on Using Video and Language Demonstrations in Door Close. The Language conditioned policy moves towards the door but fails to finish the task. The video conditioned policy completes the task successfully zero-shot.  

Resultant Language Conditioned Policy generated using

"Robot closing green drawer"

Resultant Video Conditioned Policy generated using expert video demonstration

Drawer-Close: Results on Using Video and Language Demonstrations in Drawer Close. Both the video and language conditioned policies complete the task successfully zero-shot.  

Using Out-of-Domain Videos to Generate Rewards

We pretrain a policy using the reward function rRoboCLIP by conditioning on videos from out-of-domain sources such as gifs of animated characters demonstrating a task or a human demonstrating the same task in their own environment. We find that these rewards are somewhat noisier and that policies obtained from this kind of pretraining move towards the correct object and nearly complete the task. 

Demonstration Video

Resultant Video Conditioned Policy

Button-Press Human: The rewards generated from human videos are somewhat noisier. But the robot moves towards the correct object and suffers from perceptual errors missing the red button while moving close to it. This policy is subsequently fine-tuned using a single demonstration below

Demonstration Video

Resultant Video Conditioned Policy

Drawer-Open Human: The rewards generated from human videos are somewhat noisier. The robot moves grasps the correct object but fails to fully open it

Demonstration Video

Resultant Video Conditioned Policy

Door-Open Human: The rewards generated from human videos are somewhat noisier. Here too, the robot moves grasps the handle to open the door but fails to fully open it. 

Style Transfer

We pretrain a policy using the reward function rRoboCLIP by conditioning on in-domain videos and find that the resultant policies imitate the style of the demonstration policy. 

Hinge Cabinet Zero-Shot Rewards

Demonstration Video

Resultant Video Conditioned Policy

Slide Cabinet Zero-Shot Rewards

Demonstration Video

Resultant Video Conditioned Policy

Finetuning with One Demonstration

As can be seen above utilizing language rewards during Pretraining results in policies that vary in their success rates. For example, the robot moves towards the target object but sometimes fails to successfully complete the task. Thus we attempt to finetune these imperfect trajectories using a single demonstration. 

Policy after OOD Video Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration

Policy after OOD Video Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration

Policy after OOD Video Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration

Policy after Language Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration

Policy after Language Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration

Policy after Language Pretraining


Supervised Finetuning (BC) using One Demonstration

Finetuned Policy using 1 Demonstration