Composing Pre-Trained Object-Centric Representations for Robotics
From "What" and "Where" Foundation Models

Junyao Shi*,   Jianing Qian*,   Jason Ma,   Dinesh Jayaraman

*Equal Contribution

University of Pennsylvania
ICRA 2024

Paper     Video Summary     Supplementary Materials    Code (Coming Soon)

Overview

POCR: Pre-Trained Object-Centric Representations for Robotics by chaining “what” and “where” foundation models. The “where” foundation model produces a set of segmentation masks representing object candidates in the scene. Slot binding selects which among them to bind to the slots in our object-centric representation. Image contents in each slot are represented by the “what” foundation model and their mask bounding box coordinates. The robot learns policies over slot representations.

Abstract

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose POCR, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of “what-where” representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing “where” information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing “what” the entity is. Thus, our pre-trained object-centric representations for control is constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch. 

Results

In both simulation and real-world settings, POCR with SAM as the “where” component, and LIV as the “what” component outperforms the best current representations for multi-object manipulation.

POCR vs. Prior Representations (simulation)

POCR vs. Prior Representations (real-world)

Real-World Policy Rollouts

We evaluate POCR on real-world robotic manipulation tasks. Our real-world environment consists of a realistic countertop kitchen setup with a variety of distractors, in which a Franka robot is tasked with placing various fruits, {apple, eggplant, pineapple} in the green pot located on the far side of the table. When used as visual representation for behavior cloning, POCR offers substantial gains compared to the prior state-of-the-art. See section V-E in the paper for more details. Below, we provide some example success rollouts from POCR.

POCR on Pineapple in Green Pot 

 POCR on Apple in Green Pot 

 POCR on Eggplant in Green Pot 


Qualitative Visualization of Masks

We show how POCR utilizes Segment Anything Model (SAM) to segment the scenes of various real-world and simulation multi-object manipulation tasks. Here's how to interpret the figures below:


1st row: the original RGB images from a demonstration sequence.

2nd row: the corresponding overlay of object masks for each image. Each color corresponds to a different object slot.

3rd row: the corresponding separate breakdown of slot assignments for each image. The leftmost column shows the original RGB image, the second to the left column shows the background, and the rest are object slots. 


As demonstrated by the visualizations, POCR is able to accurately identify the foreground and background in the scene. Additionally, it can consistently track target and distractor objects across different frames in a video. POCR occasionally makes mistakes when binding objects to their respective slots, especially in real-world settings (for instance, see the bottom left pan in the visualization of "RealRobot: Apple in Green Pot"). But due to our permutation-invariant policy architecture, these noises don't have a significant effect on our policy learning performance.

RealRobot: Apple in Green Pot

RealRobot: Pineapple in Green Pot

RealRobot: Eggplant in Green Pot

RLBench: Pick up Cup

RLBench: Put Rubbish in Bin

RLBench: Stack Wine

RLBench: Phone on Base

RLBench: Water Plants

RLBench: Close Box

RLBench: Close Laptop