# Learning from Imperfect Demonstrations via Adversarial Confidence Transfer

Zhangjie Cao*, Zihan Wang*, Dorsa Sadigh

Paper / Code / ICRA 2022 Talk

*denotes equal contribution

**Abstract**

Existing learning from demonstration algorithms usually assume access to expert demonstrations. However, this assumption is limiting in many real-world applications since the collected demonstrations may be suboptimal or even consist of failure cases. We therefore study the problem of learning from imperfect demonstrations by learning a confidence predictor. Specifically, we rely on demonstrations along with their confidence values from a different *correspondent* environment (source environment) to learn a confidence predictor for the environment we aim to learn a policy in (target environment---*where we only have unlabeled demonstrations.*) We learn a common latent space through adversarial distribution matching of multi-length partial trajectories to enable the transfer of confidence across source and target environments. The learned confidence reweights the demonstrations to enable learning more from informative demonstrations and discarding the irrelevant ones. Our experiments in three simulated environments and a real robot reaching task demonstrate that our approach learns a policy with the highest expected return. We show the videos of the real robot arm experiments here.

**Appendix Link for algorithm details and experimental details ****here****. **

We propose an approach that leverages the assumption of the correspondence between state-action pairs in the source and target environments leading to the existence of a common latent space, where one can learn a shared confidence predictor for both environments.

Our approach aligns both the feature-level distributions and the confidence-level distributions of partial trajectories with different lengths across domains by a domain adversarial loss. We design a multi-length partial trajectory matching, which preserves temporal relationships across consecutive states and actions, and enables an accurate matching with the corresponding trajectory in the target environment.

**Adversarial Confidence Transfer**

We aim to leverage the confidence annotation in the source environment and transfer the knowledge about the confidence to the target environment. Our key insight is to map the source and target state-action pairs to a common latent space. We enforce the latent features for correspondent state-action pairs in the source and target environments to be the same.

Here, the correspondent state-action pairs are (ssrc, asrc) and (star, atar). Since the source and target environments have different state-action spaces, we learn two different encoders Esrc and Etar to map the source and target state-action pairs into a common latent space. We first train the source encoder and the decoder in the source environment, where the decoder is shared between the source environment and target environment to predict the confidence from the latent features.

The challenge now is how to learn the parameters of the encoders to ensure that the common latent space satisfies the requirements introduced above. We learn the encoders by distribution matching between the latent features of the source and target state-action pairs in demonstrations. We will introduce two types of matching we have used: multi-length partial trajectory matching and feature-level/confidence-level matching.

**Multi-length Partial Trajectory Matching**

We develop a distribution matching objective to align the latent distribution of source and target state-action pairs. We propose a multi-length partial trajectory alignment that emphasizes learning the temporal relationship between state and action pairs. Specifically, we match the latent feature distribution of length k, (k=1,2, ... ,K) partial trajectories, which preserves the temporal relationship between consecutive states and actions, and make the latent features of state-action pairs more likely to be aligned with respect to the confidence.

**Feature-Level and Confidence-Level Matching**

We learn the latent feature distribution alignment using a generative adversarial network. Since we only want the partial trajectories of the same length to be matched, for each length of partial trajectories k, we adopt a discriminator Dk to match the latent distribution of length-k partial trajectories. We also adopt a confidence-level matching to align the distributions of confidence predictions and use the loss signals from the confidence-level matching to update the target encoder. Specifically, we use another discriminator D'k for aligning the confidence distribution of length-k partial trajectories.

**Experimental Results**

We conduct experiments on two MuJoCo environments, a simulated Franka Panda robot arm and a real Franka Panda arm. We compare our method **Ours** with standard imitation learning **GAIL**, Dynamics Cycle-Consistency **DCC**), and variants of our method: **Ours w/o L****con** and **Ours-Single**, where Ours-Single uses a single state-action pair (length-1) matching and feature-level matching, and Ours w/o **L****con **adds multi-length partial trajectory matching to Ours-Single, but still uses no confidence-level matching. Finally, we show the results of using the ground-truth confidence to reweight the target demonstrations **Oracle**.

**MuJoCo Environment**

We create 4 different MuJoCo environments: 1-joint reacher, 2-joint reacher, 4-leg ant and 5-leg ant. In our setup, 1-joint reacher and 4-leg ant are the source environments, where demonstrations are labeled with confidence scores. The 2-joint reacher and 5-leg ant are the target environments. The task for the reacher is to reach the red point from its initial configuration. The task for the ant is to move towards the right horizontally as fast as possible.

1-joint reacher

(Source )

2-joint reacher

(Target)

4-leg ant

(Source)

5-leg ant

(Target)

Here are the plots for the expected returns of the target environments:

2-joint Reacher Expected Return

5-leg Ant Expected Return

Here are the rollouts generated by three policies (GAIL, ours and oracle) for the 2-joint reacher:

GAIL

**Ours**

Oracle

Here are the rollouts generated by three policies (GAIL, ours and oracle) for the 5-leg ant:

GAIL

**Ours**

Oracle

**Simulated Robot**

**OT Setting (Learn from Optimal and Out-of-Time Demonstrations, 4 different initializations)**

**GAIL**

Out of Time

**DCC**

Out of Time

**Ours-****Feature**

Out of Range

**Ours-Confidence**

Out of Range

**Ours**

Success

**Oracle**

Success

**GAIL**

Out of Time

**DCC**

Success

**Ours-Feature**

Success

**Ours-Confidence**

Success

**Ours**

Success

**Oracle**

Success

**GAIL**

Success

**DCC**

Success

**Ours-Feature**

Out of Range

**Ours-Confidence**

Success

**Ours**

Success

**Oracle**

Success

**GAIL**

Out of Time

**DCC**

Out of Time

**Ours-Feature**

Out of Range

**Ours-Confidence**

Out of Range

**Ours**

Success

**Oracle**

Success

**OSC Setting (Learn from Optimal, Subopti****mal**** and ****Collision**** Demonstrations, 4 different initializations)**

**GAIL**

Out of Time

**DCC**

Success

**Ours****-Feature**

Success

**Ours****-Confidence**

Success

**Ours**

Success

**Oracle**

Success

**GAIL**

Out of Time

**DCC**

Success

**Ours-Feature**

Success

**Ours-Confidence**

Success

**Ours**

Success

**Oracle**

Collision

**GAIL**

Collision

**DCC**

Out of Time

**Ours-Feature**

Collision

**Ours-Confidence**

Collision

**Ours**

Collision

**Oracle**

Success

**GAIL**

Out of Time

**DCC**

Out of Time

**Ours-Feature**

Collision

**Ours****-Confidence**

Collision

**Ours**

Success

**Oracle**

Success

**Sim-to-Real Environment**

In the sim-to-real environment, we use a simulated Franka Panda Arm as the source and a real Franka Panda Arm as the target (both with 7-DoF). The task is to move the cube to the upper layer of the shelf. On the shelf, there is a large stack of books on the right and a small stack on the left. The middle area is empty. So there are three locations we could place the cube: the middle area (best), the left side (second best) and the right side.

**Demonstrations Collection (Success)**

**Place cube on the ****left**** side**

**Place cube ****i****n the ****middle area**

**Place cube on the right side**

**Demonstrations Collection (****Failure****)**

**Collide with**** the ****up board**

**Collide with the ****down**** board**

**Collide with the ****left pillar**

**Collide with the ****right book**

**Results**

GAIL

**Ours**

Oracle

**Conclusion**

We propose an algorithm to learn from imperfect demonstrations, where the demonstrations can be suboptimal or even fail at the task. We learn a confidence predictor by leveraging confidence labels and demonstrations in a different but correspondent source environmen. We show the policy learned by our method outperforms other baselines in various environments.