Visual Adversarial Imitation Learning using Variational Models

Rafael Rafailov Tianhe Yu Aravind Rajeswaran Chelsea Finn

Abstract

Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions.

Model

Figure 1: Left: The variational dynamics model, which enables joint representation learning from visual inputs and a latent space dynamics model. Dashed lines represent inference and solid lines represent the forward generative model. Right: the model based discriminator and policy training optimize the standard minimax objective entirely within the learned model. We iteratively use model-based rollouts to train the policy to induce a latent state visitation distribution that is indistinguishable from that of the expert, while the discriminator is trained to classify policy rollouts from expert latent states. The learned policy network is composed with the image encoder from the variational model to recover a visuomotor policy.

Results

Figure 2: Learning curves showing ground truth reward versus number of environment steps for V-MAIL (ours), prior model-free imitation learning approaches, and behavior cloning on five visual imitation tasks. We find that V-MAIL consistently outperforms prior methods in terms of sample efficiency, final performance, and stability, particularly for the first four environments where V-MAIL reaches near-expert performance. In the most challenging visual Baoding Balls task, which is notably difficult even with ground-truth state, only V-MAIL is able to make some progress, but all methods struggle. Confidence intervals are shown with 1 SD over 3 runs.

Ablation Experiments

Figure 3: Effect of the number of demos on agent performance. Higher number of demos correspond to higher returns and more stable training. Even with a single expert trajectory VMAIL still significantly outperforms behavior cloning with 10 demos.

Figure 4: Ablation experiment evaluating the model-based policy training components of VMAIL. The Variational DAC algorithm uses the same model as VMAIL, however it only uses it for representation purposes and trains a policy using the Discriminator Actor-Critic Algorithm in the learned latent space. While V-DAC reaches the same asymptotic return it requires 40% more data, which confirms the importance of VMAIL's model-based policy rollouts for the algorithm's sample efficiency.

Real Robot Experiments

We deployed VMAIL on a real robot task involving opening a cardboard box. In only 40 additional rollouts the robot learns manipulate the lid into the right position.

Zero-Shot Imitation with VMAIL

Train Tasks Test Tasks

Figure 5. Domains for zero-shot imitation learning. Since we train the discriminator and policy entirely within the model, we can transfer our trained model to a qualitatively different tasks and train high-quality policies witout the need for additional data.

Performance on zero-shot transfer to a new imitation learning task as percent of expert return. Each method is provided with 10 demonstrations of the target task, and zero additional samples in the environment. V-MAIL can solve the target tasks within its learned model without any additional samples, while model-free transfer learning approaches fail.

Model Training

Figure 6. First row: ground-truth sequences; Second row: action-conditioned model predictions; Third row: Pixel difference.

Page updated

Google Sites

Report abuse