AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, Sergey Levine

Robotics: Science and Systems, 2020

paper / blog / talk


Robotic reinforcement learning (RL) holds the promise of enabling robots to learn complex behaviors through experience. However, realizing this promise requires not only effective and scalable RL algorithms, but also mechanisms to reduce human burden in terms of defining the task and resetting the environment. In this paper, we study how these challenges can be alleviated with an automated robotic learning framework, in which multi-stage tasks are defined simply by providing videos of a human demonstrator and then learned autonomously by the robot from raw image observations. A central challenge in imitating human videos is the difference in morphology between the human and robot, which typically requires manual correspondence. We instead take an automated approach and perform pixel-level image translation via CycleGAN to convert the human demonstration into a video of a robot, which can then be used to construct a reward function for a model-based RL algorithm. The robot then learns the task one stage at a time, automatically learning how to reset each stage to retry it multiple times without human-provided resets. This makes the learning process largely automatic, from intuitive task specification via a video to automated training with minimal human intervention. We demonstrate that our approach is capable of learning complex tasks, such as operating a coffee machine, directly from raw image observations, requiring only 20 minutes to provide human demonstrations and about 180 minutes of robot interaction with the environment.

Automated Visual Instruction-Following with Demonstrations (AVID)

Schematic of the overall method. Left: Human instructions for each stage (top) are translated at the pixel level into robot instructions (bottom) via the CycleGAN. Note the artifacts in the generated translations, e.g., the displaced robot gripper in the bottom right image. Right: The robot attempts the task stage-wise, automatically resetting and retrying until the instruction classifier signals success, which prompts the human to confirm via a key press.

Overview Video

In our problem setting, we assume that the task we want the robot to learn is specified by a set of human demonstration videos, each of which is a trajectory of image observations depicting the human performing the task. In contrast to most prior work in robotic imitation learning, we do not assume access to robot demonstrations given through teleoperation or kinesthetic teaching. We also do not assume access to rewards provided through motion capture or other instrumented setups. Our goal is to reduce both the human cost in task specification, by having the robot learn directly from human videos, as well as the human burden during learning.

We devise a robotic learning method which we call automated visual instruction-following with demonstrations (AVID). Starting from a human demonstration, our method translates this demonstration, at the pixel level, into images of the robot performing the task, by means of unpaired image-to-image translation via CycleGAN. In order to handle multi-stage tasks, such as operating a coffee machine, we break up the human demonstration into a discrete set of instruction images, denoting the stages of the task. These instruction images are translated into synthesized robot images, which are then used to provide a reward for model-based RL, enabling the robot to practice the skill to learn its physical execution. This phase is largely automated as the robot learns to reset each stage on its own to practice it multiple times; thus, human supervision is only needed in the form of key presses and a few manual resets. We demonstrate that this approach is capable of solving complex, long-horizon tasks with minimal human involvement, removing most of the human burden associated with instrumenting the task setup, manually resetting the environment, and supervising the learning process.

CycleGAN Training

Examples of human and robot data collected to train the CycleGAN for coffee making (top) and cup retrieval (bottom). Though the robot moves randomly, we cover different settings such as the robot holding the cup and the drawer being open by manually changing the environment.

Model-Based RL

Diagram of the latent space model-based planning method that makes up the core of AVID. Clockwise from top left: first, the human instruction images are translated by the CycleGAN into synthetic robot instruction images. Next, these images along with the real robot data are used to learn a latent variable model and instruction classifiers. Then, the model is used for planning, where the reward function is given by the classifier corresponding to the stage the robot is attempting. Finally, after the classifier signals success, the human is queried to confirm or correct the classifier. The process repeats until the robot has successfully completed all stages of the task and is able to autonomously perform the entire task.


Sample sequence of instructions for coffee making (left) and cup retrieval (right) segmented from a human demonstration (top), translated into the robot's domain (bottom). The stages for coffee making from left to right are: initial state, pick up the cup, place the cup in the coffee machine, and press the button on top of the machine. The stages for cup retrieval from left to right are: initial state, grasp the handle, open the drawer, lift the arm, pick up the cup, and place the cup on top.

We evaluate our method and comparisons on two complex and temporally extended tasks: operating a coffee machine and retrieving a cup from a drawer. We compare AVID to learning from full human demonstrations with time contrastive networks (TCN) and an ablation of AVID based on behavioral cloning from observations (BCO). To understand the effects of latent space planning, we also compare to an ablation of AVID based on deep visual foresight (DVF). Finally, we evaluate BCO and behavioral cloning on the same tasks but with access to robot demonstrations, and we also analyze the human supervision burden of AVID.


We report success rates up to and including each stage for both tasks, over 10 trials. The top rows are methods that learn from human demonstrations, and we bold the best performance in this category. The bottom two rows are methods trained with direct access to real robot demonstrations. AVID outperforms all other methods from human demonstrations, succeeding 8 times out of 10 on coffee making and 7 times out of 10 on cup retrieval, and even outperforms behavioral cloning from real robot demonstrations on the later stages of cup retrieval.

Visualizing human feedback during the learning process for coffee making (left) and cup retrieval (right). The x-axis is the total number of robot stage attempts, and the y-axis indicates the proportions of feedback types, smoothed over the ten most recent attempts. "No feedback" means that the classifier did not signal success and the robot automatically switched to resetting. Coffee making and cup retrieval use a total of 131 and 126 human key presses, respectively, for the learning process.

Future Work

One advantage of our framework is that, in principle, the translation module is decoupled from the RL process and is not limited to a particular task. Thus, the most exciting direction for future work is to amortize the cost of data collection and CycleGAN training across multiple tasks, rather than training separate CycleGANs for each task of interest. For example, the cup retrieval CycleGAN can also be used to translate demonstrations for placing the cup back in the drawer, meaning that this new task can be learned from a handful of human demonstrations with no additional upfront cost. As an initial step toward this goal, we experimented with training a single CycleGAN on data collected from both the coffee making and cup retrieval tasks, the results of which are found below. We aim to further generalize this result by training a CycleGAN on an initial large dataset, e.g., many different human and robot behaviors in a kitchen that has a coffee machine, multiple drawers, and numerous other objects. This should enable any new task in the kitchen to be learned with just a few human demonstrations of the task, and this is a promising direction toward truly allowing robots to learn by watching humans.

Translation results from the same model trained on a consolidated dataset with human videos (left) and their model-generated translations (right) for each task.