Junyao Shi*, Jianing Qian*, Jason Ma, Dinesh Jayaraman
*Equal Contribution
University of Pennsylvania
ICRA 2024
There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose POCR, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of “what-where” representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing “where” information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing “what” the entity is. Thus, our pre-trained object-centric representations for control is constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.
In both simulation and real-world settings, POCR with SAM as the “where” component, and LIV as the “what” component outperforms the best current representations for multi-object manipulation.
POCR vs. Prior Representations (simulation)
POCR vs. Prior Representations (real-world)
POCR on Pineapple in Green Pot
POCR on Apple in Green Pot
POCR on Eggplant in Green Pot
As demonstrated by the visualizations, POCR is able to accurately identify the foreground and background in the scene. Additionally, it can consistently track target and distractor objects across different frames in a video. POCR occasionally makes mistakes when binding objects to their respective slots, especially in real-world settings (for instance, see the bottom left pan in the visualization of "RealRobot: Apple in Green Pot"). But due to our permutation-invariant policy architecture, these noises don't have a significant effect on our policy learning performance.
RealRobot: Apple in Green Pot
RealRobot: Pineapple in Green Pot
RealRobot: Eggplant in Green Pot
RLBench: Pick up Cup
RLBench: Put Rubbish in Bin
RLBench: Stack Wine
RLBench: Phone on Base
RLBench: Water Plants
RLBench: Close Box
RLBench: Close Laptop