Learning to Manipulate from Passive Videos

Abstract

How do we use videos of human-object interaction, without any action labels, to train manipulation policies? The traditional approach is to estimate actions (e.g. by mapping human behavior to robot from the video stream and learn a policy that imitates them (e.g. using behavior cloning). This action prediction approach has two serious issues. First, the estimated actions are noisy, and using them in behavior learning leads to brittle policies. Second, because actions are naturally multi-modal (i.e. multiple actions lead to the same effect), learning policies that capture this multi-modality across diverse demonstrations is difficult. We provide an alternative approach -- instead of learning policies for directly predicting the action to reach the desired goal, we learn a distance prediction function that estimates how far will one be from the goal after a possible action. These distances can parameterize a policy using simple greedy action selection (i.e. pick action w/ lowest predicted distance). A key advantage of our formulation is that it allows us to better exploit passive data as shown by our experimental results.