Out-of-Dynamics

Imitation Learning

Yiwen Qiu , Jialong Wu , Zhangjie Cao , Mingsheng Long [Openreview] [arXiv]

Abstract

Existing imitation learning works mainly assume that the demonstrator who collects demonstrations share the same dynamics as the imitator. However, the assumption limits the usage of imitation learning, especially when collecting demonstrations for the imitator is difficult. In this paper, we study out-of-dynamics imitation learning (OOD-IL), which relaxes the assumption to that the demonstrator and the imitator have the same state spaces but could have different action spaces and dynamics. OOD-IL enables imitation learning to utilize demonstrations from a wide range of demonstrators but introduces a new challenge: some demonstrations cannot be achieved by the imitator due to the different dynamics.

We develop a transferability measurement to tackle this newly-emerged challenge. We firstly design a novel sequence-based contrastive clustering algorithm to cluster demonstrations from the same mode to avoid the mutual interference of demonstrations from different modes, and then learn the transferability of each demonstration with an adversarial-learning based algorithm in each cluster. Experiment results on several MuJoCo environments, a driving environment, and a simulated robot environment show that the proposed transferability measurement more accurately finds and down-weights non-transferable demonstrations and outperforms prior works on the final imitation learning performance. The followings are the algorithm outline and videos for our experiments.

1. Out-of-dynamics Imitation Learning Algorithm

The figure shows the outline of our whole algorithm, which can be divided into two phases. The first phase is sequence-based contrastive clustering where we simultaneously conduct contrastive learning and clustering. We create positive pairs by subsampling different sub-trajectories from the same trajectory and use sub-trajectories from different trajectories as negative pairs. The second phase is learning transferability where we conduct an adversarial-learning based algorithm in each cluster.

2. Videos for Experiments

We show the videos for our experiments as follows.

a. Franka Panda Arm

The environment simulates the Franka Panda Robot arm with 7 degrees of freedom (DoF), which is implemented in the PyBullet. We create a task of pushing a box from one side of the desk to the other side and create different dynamics by disabling different joints of the Robot arm.

Ours

f-MDP

ID

ID-GAIL

Naive GAIL

b. Driving

As shown in following videos, we create a task where a car drives starting from anywhere at the bottom side and ends at the top side. Two obstacles are set at the center, and we create different dynamics by setting obstacles with different widths and setting different speeds for the car .

Ours

f-MDP

ID

ID-GAIL

We also include result of the ablation study here:

Ours w/o Cluster indicates removing the clustering step and learning the transferability directly from the whole set of demonstrations, and Ours w/o Cluster, Tran indicates removing both clustering and the transferability, which directly performs imitation on the whole set of demonstrations.

Ours

Ours w/o Cluster, Tran

Ours w/o Cluster

Page updated

Google Sites

Report abuse