Practical Imitation Learning (IL) systems rely on large human demonstration datasets for successful policy learning. However, challenges lie in maintaining the quality of collected data and addressing the suboptimal nature of some demonstrations, which can compromise the overall dataset quality and hence the learning outcome. Furthermore, the intrinsic heterogeneity in human behavior can produce equally successful but disparate demonstrations, further exacerbating the challenge of discerning demonstration quality. To address these challenges, this paper introduces Learning to Discern (L2D), an imitation learning framework for learning from demonstrations with diverse quality and style. Given a small batch of demonstrations with sparse quality labels, we learn a latent representation for temporally-embeded trajectory segments. Preference learning in this latent space trains a quality evaluator that generalizes to new demonstrators exhibiting different styles. Empirically, we show that L2D can effectively assess and learn from varying demonstrations, thereby leading to improved policy performance across a range of tasks in both simulation and on a physical robot.
Our framework proceeds in three primary stages during training. First, we augment trajectory segments with temporal embeddings and employ contrastive learning to map these segments to a latent space. Next, we use preference learning in this latent space to train a quality critic on sparse preference labels. Finally, we train a Gaussian Mixture Model (GMM) on the critic’s outputs where the different modes represent demonstrator quality.
When faced with unseen demonstrations, L2D partitions the trajectory into segments and augments each with its chronological ordering in the sequence. The segments are mapped to the latent space learned during training and ranked by the quality critic. After calculating the mean and variance of ranks in a full trajectory, the trained GMM is employed to predict a preference label for the unseen demonstration.
Policy learned with the demos chosen by baseline approach.
Policy learned with the demos chosen by our method.