Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?

Yuchen Cui, Scott Niekum, Abhinav Gupta, Vikash Kumar and Aravind Rajeswaran

Paper with Appendix Visual Task Dataset

published as a conference paper at the 4th Annual Learning for Dynamics & Control Conference

TL;DR:

  • ZeST is a framework for study how foundation models can enable zero-shot task specification for robot manipulation tasks.

  • Specifically we study foundation models including ImageNet-pretrained ResNet, Moco, and CLIP.

  • We find that ZeST is quite effective in zero-shot goal-selection and results in a 14-fold increase in performance over a random guessing baseline.

  • In offline RL, we find that using ZeST scores as a proxy for the reward function enables the learning of policies that perform better than a behavior cloning baseline.

Figure 1: Overview of ZeST

Motivation:

Task specification is at the core of programming autonomous robots. A low-effort modality for task specification is critical for engagement of non-expert end users and ultimate adoption of personalized robot agents.

A well known approach to task specification is through goals. Existing approaches to goal specification utilize either low dimensional state vectors or goal images from the same robot scene. The former is often not easily human interpretable and offloads the difficulty to state estimation and scene understanding. The latter requires the generation of desired goal image, which often requires a human to complete the task, often defeating the purpose of having autonomous robots.

In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use such as images obtained from the internet. As a first step towards this, we study the capabilities of foundation models for zero-shot goal specification, and find that they are surprisingly effective in a collection of simulated robot manipulation tasks and real-world datasets.


Contributions:

(1) We introduce a framework for studying foundation models for zero-shot task specification (ZeST). See Figure 1.

(2) We evaluate the effectiveness of ZeST for enabling zero-shot policy execution through a set of goal selection tasks.

(3) We evaluate ZeST for enabling policy learning in offline reinforcement learning.

An instantiation of the ZeST framework with delta features. We observe that similarity of the observation (top row robot frames) with the task specification (open cabinet) increases as the robot executes a successful trajectory.