Learning to Manipulate from Passive Videos
Abstract
How do we use videos of human-object interaction, without any action labels, to train manipulation policies? The traditional approach is to estimate actions (e.g. by mapping human behavior to robot from the video stream and learn a policy that imitates them (e.g. using behavior cloning). This action prediction approach has two serious issues. First, the estimated actions are noisy, and using them in behavior learning leads to brittle policies. Second, because actions are naturally multi-modal (i.e. multiple actions lead to the same effect), learning policies that capture this multi-modality across diverse demonstrations is difficult. We provide an alternative approach -- instead of learning policies for directly predicting the action to reach the desired goal, we learn a distance prediction function that estimates how far will one be from the goal after a possible action. These distances can parameterize a policy using simple greedy action selection (i.e. pick action w/ lowest predicted distance). A key advantage of our formulation is that it allows us to better exploit passive data as shown by our experimental results.
Tasks
Pushing
First Person View (Policy View)
Third Person View
Stacking
First Person View (Policy View)
Third Person View
Opening
First Person View (Policy View)
Third Person View
Turning
First Person View (Policy View)
Third Person View
Taking Cloth
First Person View (Policy View)
Third Person View