Out-of-Dynamics
Imitation Learning
Yiwen Qiu , Jialong Wu , Zhangjie Cao , Mingsheng Long [Openreview] [arXiv]
Yiwen Qiu , Jialong Wu , Zhangjie Cao , Mingsheng Long [Openreview] [arXiv]
Existing imitation learning works mainly assume that the demonstrator who collects demonstrations share the same dynamics as the imitator. However, the assumption limits the usage of imitation learning, especially when collecting demonstrations for the imitator is difficult. In this paper, we study out-of-dynamics imitation learning (OOD-IL), which relaxes the assumption to that the demonstrator and the imitator have the same state spaces but could have different action spaces and dynamics. OOD-IL enables imitation learning to utilize demonstrations from a wide range of demonstrators but introduces a new challenge: some demonstrations cannot be achieved by the imitator due to the different dynamics.
We develop a transferability measurement to tackle this newly-emerged challenge. We firstly design a novel sequence-based contrastive clustering algorithm to cluster demonstrations from the same mode to avoid the mutual interference of demonstrations from different modes, and then learn the transferability of each demonstration with an adversarial-learning based algorithm in each cluster. Experiment results on several MuJoCo environments, a driving environment, and a simulated robot environment show that the proposed transferability measurement more accurately finds and down-weights non-transferable demonstrations and outperforms prior works on the final imitation learning performance. The followings are the algorithm outline and videos for our experiments.
1. Out-of-dynamics Imitation Learning Algorithm
The figure shows the outline of our whole algorithm, which can be divided into two phases. The first phase is sequence-based contrastive clustering where we simultaneously conduct contrastive learning and clustering. We create positive pairs by subsampling different sub-trajectories from the same trajectory and use sub-trajectories from different trajectories as negative pairs. The second phase is learning transferability where we conduct an adversarial-learning based algorithm in each cluster.
2. Videos for Experiments
We show the videos for our experiments as follows.
a. Franka Panda Arm
The environment simulates the Franka Panda Robot arm with 7 degrees of freedom (DoF), which is implemented in the PyBullet. We create a task of pushing a box from one side of the desk to the other side and create different dynamics by disabling different joints of the Robot arm.
b. Driving
As shown in following videos, we create a task where a car drives starting from anywhere at the bottom side and ends at the top side. Two obstacles are set at the center, and we create different dynamics by setting obstacles with different widths and setting different speeds for the car .
We also include result of the ablation study here:
Ours w/o Cluster indicates removing the clustering step and learning the transferability directly from the whole set of demonstrations, and Ours w/o Cluster, Tran indicates removing both clustering and the transferability, which directly performs imitation on the whole set of demonstrations.