Composing Pre-Trained Object-Centric Representations for Robotics
From "What" and "Where" Foundation Models
Junyao Shi*, Jianing Qian*, Jason Ma, Dinesh Jayaraman
*Equal Contribution
University of Pennsylvania
ICRA 2024
Paper Video Summary Supplementary Materials Code (Coming Soon)
Overview
POCR: Pre-Trained Object-Centric Representations for Robotics by chaining “what” and “where” foundation models. The “where” foundation model produces a set of segmentation masks representing object candidates in the scene. Slot binding selects which among them to bind to the slots in our object-centric representation. Image contents in each slot are represented by the “what” foundation model and their mask bounding box coordinates. The robot learns policies over slot representations.
Abstract
There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose POCR, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of “what-where” representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing “where” information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing “what” the entity is. Thus, our pre-trained object-centric representations for control is constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.
Results
In both simulation and real-world settings, POCR with SAM as the “where” component, and LIV as the “what” component outperforms the best current representations for multi-object manipulation.
POCR vs. Prior Representations (simulation)
POCR vs. Prior Representations (real-world)
Real-World Policy Rollouts
We evaluate POCR on real-world robotic manipulation tasks. Our real-world environment consists of a realistic countertop kitchen setup with a variety of distractors, in which a Franka robot is tasked with placing various fruits, {apple, eggplant, pineapple} in the green pot located on the far side of the table. When used as visual representation for behavior cloning, POCR offers substantial gains compared to the prior state-of-the-art. See section V-E in the paper for more details. Below, we provide some example success rollouts from POCR.
POCR on Pineapple in Green Pot
POCR on Apple in Green Pot
POCR on Eggplant in Green Pot
Qualitative Visualization of Masks
We show how POCR utilizes Segment Anything Model (SAM) to segment the scenes of various real-world and simulation multi-object manipulation tasks. Here's how to interpret the figures below:
1st row: the original RGB images from a demonstration sequence.
2nd row: the corresponding overlay of object masks for each image. Each color corresponds to a different object slot.
3rd row: the corresponding separate breakdown of slot assignments for each image. The leftmost column shows the original RGB image, the second to the left column shows the background, and the rest are object slots.
As demonstrated by the visualizations, POCR is able to accurately identify the foreground and background in the scene. Additionally, it can consistently track target and distractor objects across different frames in a video. POCR occasionally makes mistakes when binding objects to their respective slots, especially in real-world settings (for instance, see the bottom left pan in the visualization of "RealRobot: Apple in Green Pot"). But due to our permutation-invariant policy architecture, these noises don't have a significant effect on our policy learning performance.
RealRobot: Apple in Green Pot
RealRobot: Pineapple in Green Pot
RealRobot: Eggplant in Green Pot
RLBench: Pick up Cup
RLBench: Put Rubbish in Bin
RLBench: Stack Wine
RLBench: Phone on Base
RLBench: Water Plants
RLBench: Close Box
RLBench: Close Laptop