Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, large language model is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity.
Unlike previous works that show poor shape consistency or spatial arrangements, Planner3D can synthesize higher-fidelity 3D scenes which demonstrate more realistic layout configuration while preserving shape consistency and diversity.
Given a scene graph describing the desired multi-object 3D scene using objects as nodes and their relationships as edges, Planner3D is able to synthesize realistic 3D scenes with consistent 3D object shapes and spatial layouts. Planner3D consists of two main components: a scene graph prior enhancement mechanism and a dual-branch encoder-decoder architecture for graph-to-scene generation.
First, scene graph prior is enriched with LLM and vision-language model CLIP, and explicitly aggregates node-wise, edge-wise and global textual representations of the input scene graph. The aggregated representation is fed to Graph Convolutional Networks (GCNs) and Multilayer Perceptron (MLP) layers, which are trained to model the posterior distribution of the 3D scene conditioned on the given scene graph. With the learned distribution Z, we update the node representation by replacing the original layout vector with the random vector sampled from Z. Then, a graph encoder learns to extract graph features from the updated graph representation. A scene decoder takes graph features as inputs and learns to generate 7-degrees-of-freedom 3D layout and shape latent through layout decoder and shape decoder, respectively. Compositional 3D scenes are eventually synthesized by fitting the generated layouts with the 3D shapes reconstructed from the shape latent by the pre-trained Vector Quantized Variational Autoencoder (VQ-VAE) decoder.
On one hand, our layout synthesis is constrained with IoU-based regularization, which efficiently alleviates object collision problems compared to those un-regularized layout regression approaches, as shown in the nightstand-double bed-wardrobe case (in the bedroom). On the other hand, we notice that the 3D shapes generated from Planner3D are more controllable and not as random as prior works. It can be seen from the table-and-chair cases, the table is generated as a round shape (in the living room) or with a single table leg (in the dining room), which match the arrangements of the surrounding chairs.
Bedroom Living Room Dining Room
Scene Graph