What Can I Do Here? Learning New Skills by Imagining Visual Affordances

Alexander Khazatsky*, Ashvin Nair*, Daniel Jing, Sergey Levine

*Equal Contribution

University of California, Berkeley

International Conference on Robotics and Automation (ICRA) 2021

Problem Statement

  • How can robots learn about affordances from prior datasets and, when faced with new and unfamiliar environments, utilize this knowledge to practice relevant skills and update their policies efficiently?

  • On the left we first see videos from a prior dataset collected with the robot accomplishing various tasks. Next, during "unsupervised practice" the robot is placed in an environment with a pot lid it has never seen before. How can the robot learn to manipulate the environment successfully and grasp this shoe without any external supervision or knowledge of the downstream task?

  • Visuomotor Affordance Learning (VAL) tackles this challenge.

Method: Offline Phase

Given a prior dataset demonstrating the affordances of various environments, VAL digests this information in three offline steps:

  1. First, VAL learns a compressed representation of this data using a Vector Quantized Variational Auto-encoder or VQVAE.

  2. Next, VAL learns an affordance model by training a conditional PixelCNN on the 12x12 discrete latent space of the VQVAE.

  3. Last in the offline phase, VAL trains a goal conditioned policy on the prior dataset using Advantage Weighted Actor Critic, an algorithm specifically designed for training offline and being amenable to online fine-tuning.

Method: Online Phase

Now, when VAL is placed in an unseen environment, it:

  1. Uses its prior knowledge to imagine visual representations of useful affordances.

  2. Collects helpful interaction data by trying to achieve these affordances.

  3. Updates its parameters using its new experience.

  4. Repeats the process all over again.

Real World Evaluation

  • We evaluate VAL in five real-world test environments containing unseen interaction objects, and asses its ability to achieve the corresponding affordances.

  • In every case, we begin with the offline trained policy, which solves the task inconsistently. Then, we collect more experience using our affordance model to sample goals. In total, this amounts to about 5 minutes of real-world robot interaction time. Finally, we evaluate the fine-tuned policy, which consistently solve the task.

Real World Results

  • We find that on each of these environments, VAL consistently demonstrates effective zero-shot generalization after offline training, followed by rapid improvement with its affordance-directed fine-tuning scheme.

  • Meanwhile, prior self-supervised methods barely improve upon poor zero-shot performance in these new environments.

Simulated Experiments

  • For further analysis, we run VAL in a procedurally generated, multi-task environment with visual and dynamic variation. Which objects are in the scene, their colors, and their positions are randomized per environment.

  • Visualizing the policy after fine-tuning, we see that again, given a single off-policy dataset, our method quickly learns advanced manipulation skills for a diverse set of novel objects.

Project Video


Our datasets are available for download, see Datasets


The latest revision of the paper is available at http://arxiv.org/abs/2106.00671

Dataset Download Instructions

Our real-world robot datasets are available for download, see Datasets


Instructions here: https://github.com/anair13/rlkit/blob/master/examples/val/README.md

Links to the algorithm, new simulation environments, and simulation datasets are provided.