Abstract
Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions.
Model
Results
Figure 2: Learning curves showing ground truth reward versus number of environment steps for V-MAIL (ours), prior model-free imitation learning approaches, and behavior cloning on five visual imitation tasks. We find that V-MAIL consistently outperforms prior methods in terms of sample efficiency, final performance, and stability, particularly for the first four environments where V-MAIL reaches near-expert performance. In the most challenging visual Baoding Balls task, which is notably difficult even with ground-truth state, only V-MAIL is able to make some progress, but all methods struggle. Confidence intervals are shown with 1 SD over 3 runs.
Ablation Experiments
Figure 3: Effect of the number of demos on agent performance. Higher number of demos correspond to higher returns and more stable training. Even with a single expert trajectory VMAIL still significantly outperforms behavior cloning with 10 demos.
Figure 4: Ablation experiment evaluating the model-based policy training components of VMAIL. The Variational DAC algorithm uses the same model as VMAIL, however it only uses it for representation purposes and trains a policy using the Discriminator Actor-Critic Algorithm in the learned latent space. While V-DAC reaches the same asymptotic return it requires 40% more data, which confirms the importance of VMAIL's model-based policy rollouts for the algorithm's sample efficiency.
Real Robot Experiments
We deployed VMAIL on a real robot task involving opening a cardboard box. In only 40 additional rollouts the robot learns manipulate the lid into the right position.
Zero-Shot Imitation with VMAIL
Train Tasks Test Tasks
Figure 5. Domains for zero-shot imitation learning. Since we train the discriminator and policy entirely within the model, we can transfer our trained model to a qualitatively different tasks and train high-quality policies witout the need for additional data.
Model Training
Figure 6. First row: ground-truth sequences; Second row: action-conditioned model predictions; Third row: Pixel difference.