Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di,
Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam
ICRA 2024
TL; DR
This is the project website for the SG-Bot, a novel framework rearranging objects by scene imagination on scene graphs.
SG-Bot stacks three stages: Observation, Imagination, and Execution. The first stage extracts objects as semantic nodes. In the second stage, these nodes format a scene graph using preserved knowledge, representing a coarse goal state. The scene graph is subsequentially transferred to a plausible scene as a fine goal state. In the final stage, the initial scene matches the goal scene to generate the rearrangement policies. The embodiment keeps rearranging objects until the goal state is achieved where each object is at the target pose.
Results
Simulation comparison:
Real-world trials:
Baseline Reproduction
At each step, StructFormer[1] selects and rearranges an object by reasoning the movement on the current observation of the selected object and the previous movement in an autoregressive way.
Following the original format, we prepare our dataset by parsing the transformation from initial to goal in several steps to obtain the autoregressive fashion. As shown in the above figure, the red points represent the current and previous rearrangement. We also manually define that the category 'Obstacle' (unknown objects) remains static during the rearrangement to mimic the Object Selection process in the original paper. These objects will be removed from the table after all other objects are in their positions. After the preparation, we fully trained the network using the same training split as ours.
Our dataset in StructFormer's format will be uploaded here. The reproduced StructFormer will be uploaded here.
[1] Liu, et al. "Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects." ICRA 2022.
[2] Zeng, et al. "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language." ICLR 2023.
[3] Gu, et al. "Open-vocabulary Object Detection via Vision and Language Knowledge Distillation." ICLR 2022.
[4] OpenAI. ChatGPT. url: https://openai.com/blog/chatgpt/. Accessed: 2023-02-08. 2023.
[5] Shridhar, et al. "Cliport: What and where pathways for robotic manipulation." CoRL 2022.
Socratic Models[2] stacks several language-driven models: ViLD[3], ChatGPT[4], and CLIPort[5], with the last one needing to be trained. Unlike Structformer in a sequential manner, CLIPort is a step-wise 2D planning network based on the current language description without requiring connections between adjacent steps.
Following the original repository, we prepare the dataset by recording the target position in the initial scene and pair the transformation from source to target position with a spatial description, as shown in the above xy map from the top-down view. Assuming that a fork on the plate is supposed to be placed to the right of and close to the plate, we first record its goal position in the initial scene, and then we calculate the displacement between the initial position and the goal position. Finally, we put a sentence describing the relationship between the fork and a randomly selected object with the displacement, for example, "Put the fork to the right of and close by the plate.". Note that for more generalization ability, the description can also be rephrased by describing the target object with another object, like a knife, as "Put the fork to the right of the knife."
Our dataset in CLIPort's format will be uploaded here. The reproduced Socratic Models will be uploaded here.
Baselines vs. Ours on Methodology
StructeFormer models the goal in an implicit way by autoregressively estimating the relative pose of an object upon its points and the estimated relative pose of the last object.
As shown in the above figure, The framework starts with the sentence "Set the table.". Then, it first predicts the structure pose T0 upon which the framework selects an object P1 and predicts its relative pose T1. After rearranging P1, The prediction of T2 considers both P2 and T1, and so forth.
Socratic Models obtain the explicit goal states depending on the GPT descriptions. It needs long templates encompassing experiences in the training dataset. The purpose of such templates is to let GPT utilize the ability of contextual completion to give a solution to a new scene.
For example, in the first part of the above figure, We design n different template blocks, each containing a solution for a scene rearrangement. In the second part, the framework uses a detection module to extract objects as prompts and sends them to GPT along with the templates. The output steps said by GPT will be transferred to sentences, which CLIPort can parse to movements.
The goal states are obtained in a coarse-to-fine way. The coarse one is a scene graph describing the relationships between two objects. The fine one is an imagined goal scene on the scene graph.
As shown in the above figure, SG-Bot first extracts objects as nodes by a segmentation module. Then, the nodes are constructed as a scene graph, using either the commonsense knowledge or the user-defined rules. The final goal states are represented as a scene generated by the scene graph using a scene generative model Graph-to-3D. The scene serves as the final guidance of object rearrangement.
SG-Bot: Inference Pipeline
Observation: Given an RGB-D image, we extract objects from the initial scene by a segmentation module. Concurrently, the depth is back-projected and segmented as object point clouds. These points are each normalized under the camera view.
Imagination: We use a graph constructor to model these objects as a semantic scene graph as the coarse goal states. The scene graph is transformed into a plausible scene for the fine imagination. To make the generated shapes consistent with the ones in the initial scenes, we use a trained shape encoder to encode the shape priors from the initial scenes and inject them into the graph. Then, the layout decoder decodes bounding boxes as the layouts. The code decoder produces the shape codes, which further become canonical shapes and populate the layouts as scenes.
Execution: The initial observation is matched with the goal scene to generate the policy, according to which the robot can rearrange the object to update the observation. The observation is matched again with the goal and so forth until all objects are in their poses.
SG-Bot: Modular Training
a) AE and AD are trained using full shapes in the canonical view to have the shape code α, while BE and BD are trained on partial shapes in the initial scenes under the camera view to have the shape priors β. AD and BE are retained during inference.
b) A scene graph with textual information is processed through embedding layers MO, MΓ to have implicit class features ci, ci→j on each node and edge.
c) For training Graph-to-3D on goal scenes, the processed scene graph is first concatenated with α and bounding box parameters B on the shape branch ΦE--ΦD and layout branch LE--LD, respectively. Φ and L jointly model the latent shape-aware scene graph.
Modules in b) and c) are jointly trained, with MO, MΓ, ΦD and LD used during inference.
If you feel that this work has helped your research a bit, please kindly consider citing it:
@article{zhai2023sgbot,
title={SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs},
author={Zhai, Guangyao and Cai, Xiaoni and Huang, Dianye and Di, Yan and Manhardt, Fabian and Tombari, Federico and Navab, Nassir and Busam, Benjamin},
journal={arXiv preprint arXiv:2309.12188},
}
journal={arXiv preprint arXiv:2309.09