Learning from Imperfect Demonstrations via Adversarial Confidence Transfer
Zhangjie Cao*, Zihan Wang*, Dorsa Sadigh
Paper / Code / ICRA 2022 Talk
*denotes equal contribution
Abstract
Existing learning from demonstration algorithms usually assume access to expert demonstrations. However, this assumption is limiting in many real-world applications since the collected demonstrations may be suboptimal or even consist of failure cases. We therefore study the problem of learning from imperfect demonstrations by learning a confidence predictor. Specifically, we rely on demonstrations along with their confidence values from a different correspondent environment (source environment) to learn a confidence predictor for the environment we aim to learn a policy in (target environment---where we only have unlabeled demonstrations.) We learn a common latent space through adversarial distribution matching of multi-length partial trajectories to enable the transfer of confidence across source and target environments. The learned confidence reweights the demonstrations to enable learning more from informative demonstrations and discarding the irrelevant ones. Our experiments in three simulated environments and a real robot reaching task demonstrate that our approach learns a policy with the highest expected return. We show the videos of the real robot arm experiments here.
Appendix Link for algorithm details and experimental details here.
We propose an approach that leverages the assumption of the correspondence between state-action pairs in the source and target environments leading to the existence of a common latent space, where one can learn a shared confidence predictor for both environments.
Our approach aligns both the feature-level distributions and the confidence-level distributions of partial trajectories with different lengths across domains by a domain adversarial loss. We design a multi-length partial trajectory matching, which preserves temporal relationships across consecutive states and actions, and enables an accurate matching with the corresponding trajectory in the target environment.
Adversarial Confidence Transfer
We aim to leverage the confidence annotation in the source environment and transfer the knowledge about the confidence to the target environment. Our key insight is to map the source and target state-action pairs to a common latent space. We enforce the latent features for correspondent state-action pairs in the source and target environments to be the same.
Here, the correspondent state-action pairs are (ssrc, asrc) and (star, atar). Since the source and target environments have different state-action spaces, we learn two different encoders Esrc and Etar to map the source and target state-action pairs into a common latent space. We first train the source encoder and the decoder in the source environment, where the decoder is shared between the source environment and target environment to predict the confidence from the latent features.
The challenge now is how to learn the parameters of the encoders to ensure that the common latent space satisfies the requirements introduced above. We learn the encoders by distribution matching between the latent features of the source and target state-action pairs in demonstrations. We will introduce two types of matching we have used: multi-length partial trajectory matching and feature-level/confidence-level matching.
Multi-length Partial Trajectory Matching
We develop a distribution matching objective to align the latent distribution of source and target state-action pairs. We propose a multi-length partial trajectory alignment that emphasizes learning the temporal relationship between state and action pairs. Specifically, we match the latent feature distribution of length k, (k=1,2, ... ,K) partial trajectories, which preserves the temporal relationship between consecutive states and actions, and make the latent features of state-action pairs more likely to be aligned with respect to the confidence.
Feature-Level and Confidence-Level Matching
We learn the latent feature distribution alignment using a generative adversarial network. Since we only want the partial trajectories of the same length to be matched, for each length of partial trajectories k, we adopt a discriminator Dk to match the latent distribution of length-k partial trajectories. We also adopt a confidence-level matching to align the distributions of confidence predictions and use the loss signals from the confidence-level matching to update the target encoder. Specifically, we use another discriminator D'k for aligning the confidence distribution of length-k partial trajectories.
Experimental Results
We conduct experiments on two MuJoCo environments, a simulated Franka Panda robot arm and a real Franka Panda arm. We compare our method Ours with standard imitation learning GAIL, Dynamics Cycle-Consistency DCC), and variants of our method: Ours w/o Lcon and Ours-Single, where Ours-Single uses a single state-action pair (length-1) matching and feature-level matching, and Ours w/o Lcon adds multi-length partial trajectory matching to Ours-Single, but still uses no confidence-level matching. Finally, we show the results of using the ground-truth confidence to reweight the target demonstrations Oracle.
MuJoCo Environment
We create 4 different MuJoCo environments: 1-joint reacher, 2-joint reacher, 4-leg ant and 5-leg ant. In our setup, 1-joint reacher and 4-leg ant are the source environments, where demonstrations are labeled with confidence scores. The 2-joint reacher and 5-leg ant are the target environments. The task for the reacher is to reach the red point from its initial configuration. The task for the ant is to move towards the right horizontally as fast as possible.
1-joint reacher
(Source )
2-joint reacher
(Target)
4-leg ant
(Source)
5-leg ant
(Target)
Here are the plots for the expected returns of the target environments:
2-joint Reacher Expected Return
5-leg Ant Expected Return
Here are the rollouts generated by three policies (GAIL, ours and oracle) for the 2-joint reacher:
GAIL
Ours
Oracle
Here are the rollouts generated by three policies (GAIL, ours and oracle) for the 5-leg ant:
GAIL
Ours
Oracle
Simulated Robot
OT Setting (Learn from Optimal and Out-of-Time Demonstrations, 4 different initializations)
GAIL
Out of Time
DCC
Out of Time
Ours-Feature
Out of Range
Ours-Confidence
Out of Range
Ours
Success
Oracle
Success
GAIL
Out of Time
DCC
Success
Ours-Feature
Success
Ours-Confidence
Success
Ours
Success
Oracle
Success
GAIL
Success
DCC
Success
Ours-Feature
Out of Range
Ours-Confidence
Success
Ours
Success
Oracle
Success
GAIL
Out of Time
DCC
Out of Time
Ours-Feature
Out of Range
Ours-Confidence
Out of Range
Ours
Success
Oracle
Success
OSC Setting (Learn from Optimal, Suboptimal and Collision Demonstrations, 4 different initializations)
GAIL
Out of Time
DCC
Success
Ours-Feature
Success
Ours-Confidence
Success
Ours
Success
Oracle
Success
GAIL
Out of Time
DCC
Success
Ours-Feature
Success
Ours-Confidence
Success
Ours
Success
Oracle
Collision
GAIL
Collision
DCC
Out of Time
Ours-Feature
Collision
Ours-Confidence
Collision
Ours
Collision
Oracle
Success
GAIL
Out of Time
DCC
Out of Time
Ours-Feature
Collision
Ours-Confidence
Collision
Ours
Success
Oracle
Success
Sim-to-Real Environment
In the sim-to-real environment, we use a simulated Franka Panda Arm as the source and a real Franka Panda Arm as the target (both with 7-DoF). The task is to move the cube to the upper layer of the shelf. On the shelf, there is a large stack of books on the right and a small stack on the left. The middle area is empty. So there are three locations we could place the cube: the middle area (best), the left side (second best) and the right side.
Demonstrations Collection (Success)
Place cube on the left side
Place cube in the middle area
Place cube on the right side
Demonstrations Collection (Failure)
Collide with the up board
Collide with the down board
Collide with the left pillar
Collide with the right book
Results
GAIL
Ours
Oracle
Conclusion
We propose an algorithm to learn from imperfect demonstrations, where the demonstrations can be suboptimal or even fail at the task. We learn a confidence predictor by leveraging confidence labels and demonstrations in a different but correspondent source environmen. We show the policy learned by our method outperforms other baselines in various environments.