Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation

[View Paper as pdf]  [View Code on Github]



Imitation learning is an effective approach for autonomous systems to acquire control policies when an explicit reward function is unavailable, using supervision provided as demonstrations from an expert, typically a human operator. However, standard imitation learning methods assume that the agent receives examples of observation-action tuples that could be provided, for instance, to a supervised learning algorithm. This stands in contrast to how humans and animals imitate: we observe another person performing some behavior and then figure out which actions will realize that behavior, compensating for changes in viewpoint, surroundings, and embodiment. We term this kind of imitation learning as imitation-from-observation and propose an imitation learning method based on video prediction with context translation and deep reinforcement learning. This lifts the assumption in imitation learning that the demonstration should consist of observations and actions in the same environment, and enables a variety of interesting applications, including learning robotic skills that involve tool use simply by observing videos of human tool use. Our experimental results show that our approach can perform imitation-from-observation for a variety of real-world robotic tasks modeled on common household chores, acquiring skills such as sweeping from videos of a human demonstrator.


Our imitation-from-observation algorithm is based on learning a context translation model that can convert a demonstration from one context (e.g., a third person viewpoint and a human demonstrator) to another context (e.g., a first person viewpoint and a robot). By training a model to perform this conversion, we acquire a feature representation that is suitable for tracking demonstrated behavior. We then use deep reinforcement learning to optimize for the actions that optimally track the translated demonstration in the target context. 

For encoders Enc1 and Enc2 in simulation we use stride-2 convolutions with a 5 × 5 kernel. We perform 4 convolutions with filter sizes 64, 128, 256, and 512 followed by two fully-connected layers of size 1024. We use LeakyReLU activations with leak 0.2 for all layers. The translation module T(z1, z2) consists of one hidden layer of size 1024 with input as the concatenation of z1 and z2 and output of size 1024. For the decoder Dec in simulation we have a fully connected layer from the input to four fractionally-strided convolutions with filter sizes 256, 128, 64, 3 and stride 1 2 . We have skip connections from every layer in the context encoder Enc2 to its corresponding layer in the decoder Dec by concatenation along the filter dimension. 

For real world images, the encoders perform 4 convolutions with filter sizes 32, 16, 16, 8 and strides 1, 2, 1, 2 respectively. All fully connected layers and feature layers are size 100 instead of 1024. The decoder uses fractionally-strided convolutions with filter sizes 16, 16, 32, 3 with strides 1/2 , 1, 1/2 , 1 respectively. For the real world model only, we apply dropout for every fully connected layer with keep probability 0.5, and we tie the weights of Enc1 and Enc2. 

We train using the ADAM optimizer with learning rate 10^−4 . We train using 3000 videos for simulated reach, 4500 videos for simulated push, 894 videos for simulated sweep, 3500 videos for simulated strike, 180 videos for simulated push with real videos, 135 videos for real push with real videos, 85 videos for real sweep with paper, 100 videos for real sweep with almonds, and 60 videos for cooking almonds.