Reinforcement Learning with Action-Free Pre-training from Videos

Younggyo Seo, Kimin Lee, Stephen James, Pieter Abbeel

KAIST, UC Berkeley

[Paper] [Code]


Recent unsupervised pre-training methods have shown to be effective on language and vision domains by learning useful representations for multiple downstream tasks. In this paper, we investigate if such unsupervised pre-training methods can also be effective for vision-based reinforcement learning (RL). To this end, we introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos. Our framework consists of two phases: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning action-conditional world models on unseen environments. To incorporate additional action inputs during fine-tuning, we introduce a new architecture that stacks an action-conditional latent prediction model on top of the pre-trained action-free prediction model. Moreover, for better exploration, we propose a video-based intrinsic bonus that leverages pre-trained representations. We demonstrate that our framework significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks.

Action-free Pre-training from Videos (APV)

To capture rich dynamics information from videos, we pre-train an action-free latent video prediction model. Since our goal is to learn the representations that can be transferred to various downstream tasks that can be transferred to various downstream tasks from readily available videos, our framework do not require the videos to be collected in the same domain of the downstream tasks, and also do not assume the datasets contain action information.

To utilize the pre-trained model for RL, we propose (i) stacked latent prediction model to learn action-conditional dynamics model which is utilized for planning and policy learning, and (ii) video-based intrinsic bonus that encourages exploration by exploiting the rich dynamics information from the pre-trained model.

(a) Action-free Pre-training

(b) Stacked Latent Prediction Model & Video-based Intrinsic Bonus

Experimental Setups

We first pre-train our action-free prediction model on RLBench videos, and consider two below fine-tuning setups to evaluate our method:

  • Fine-tuning on a range of robotic manipulation tasks from Meta-world with a big domain gap from RLBench videos.

  • Fine-tuning on robotic locomotion tasks from DeepMind Control Suite, where both the visuals and task objectives significantly differ from RLBench videos.

Meta-world Experiments

We find that APV can improve the sample-efficiency of DreamerV2 on six robotic manipulation tasks from Meta-world. Specifically, APV achieves the aggregate success rate of 95.4%, while DreamerV2 achieves 67.9%.

DeepMind Control Suite Experiments

We also find that APV pre-trained on manipulation videos from RLBench consistently achieves better performance than DreamerV2. Furthermore, to investigate how the domain gap affects the performance, we also report the performance when we pre-train APV using in-domain videos collected from a similar domain, i.e., Triped environment, where APV's performance is further improved. This shows that addresing the domain gap between pre-training and fine-tuning is important.