SORNet: Spatial Object-Centric Representations for Sequential Manipulation 

 Wentao Yuan, Chris Paxton,  Karthik Desingh, Dieter Fox

University of Washington,  NVIDIA 

Code and Data

An implementation of SORNet using Pytorch as well as the training and evaluation data can be found at


Sequential manipulation tasks require a robot to constantly reason about spatial relationships among entities in the scene. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects or novel tasks. Thus, we propose SORNet (Spatial Object-centric Representation Network), a frame for extracting object-centric representations from RGB observations that enables zero-shot generalization to unseen objects on various spatial reasoning tasks. The video on the right shows SORNet in action on a real robot.


SORNet consists of two parts. The embedding network extracts object-centric embeddings from RGB images conditioned on canonical views of the objects of interest. A set of readout networks take the embedding vectors and predicts discrete or continuous spatial relations among entities in the scene. Note that the object queries (canonical object views) can be captured in scenarios different from the input image (e.g. with different lighting and camera view).

Embedding Network

The embedding network takes an RGB image and a set of object queries represented as canonical views and outputs object embeddings corresponding to the object queries. The RGB image is broken into context patches which have the same size as the canonical views. These patches are flattened, projected and passed through a multi-layer multi-head transformer. The embeddings corresponding to the canonical views are used for downstream tasks such as relation prediction. Additional views and modes of observations such as depth can be optionally added to the network. The top left inset shows examples of canonical object views used during training.

Readout Networks

Readout networks are simple MLPs which uses object embeddings to predict spatial relations, such as logical statements that can serve as skill preconditions or continuous 3D directions. The readout networks are flexible to accommodate any number of input embeddings without changing its parameters.


In order to train and evaluate the performance of different methods on spatial reasoning for robot manipulation, we built two large-scale simulation datasets, "leonardo" and "kitchen". Our dataset contains visually and physically realistic multi-view RGB-D sequences of manipulation scenes. Our dataset includes features such as clutter and occlusion by the robot arm that are not present in existing datasets.

Example sequences from the kitchen dataset, featuring long task horizon, large object variety, clutter, occlusion and complex object relations. In addition to the tabletop, objects can be placed on top of each other or on the shelf beside the table.

Example tasks from the leonardo dataset, which involves complex reasoning and long-horizon task planning.

Downstream Tasks

Predicate Classification

Given an RGB observation from a sensor, along with canonical objects views for querying objects of interest in the scene, SORNet outputs the object centric embeddings for each of the query object. A set of readout networks are trained to perform the task of classifying spatial relations between objects as True or False.

Qualitative predicate classification results on real-world tabletop scenes. All of the objects shown here are not seen by the network during training. Each column is a different scenario. The first and second row shows the side and front view of the scene respectively, followed by canonical views of the query objects. Black text denotes correctly labelled true predicates; blue text denotes false positive predictions; and red text denotes false negatives. True negatives are not shown due to limited space. 

Relative Direction Regression 

What else can the embedding capture?

Although the embedding network was trained with logical predicates, the large amount of data in the training coupled with the transformer based network architecture, enables the network to capture continuous spatial information. 

We froze the embedding network and trained readout networks to learn the relative 3D direction between the entities in the scene. In the figure onthe left, the left column shows the relative direction from the end-effector to the center of objects and the right column shows the relative direction between objects centers. Network predictions are plotted as solid lines and ground truths are plotted as dotted lines.


With the ability to predict the 3D direction from the robot end-effector to a target object, we can use SORNet to generate position control targets that guide the robot to move toward a particular object, as shown in the video on the left.