EchoScene

Indoor Scene Generation via Information Echo over Scene Graph Diffusion

Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao,

Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam

ECCV 2024

TL; DR

We present EchoScene, a dual-branch diffusion model that is able to generate and manipulate 3D scenes with given scene graphs. In both branches, each node is allocated with a denoising process to support graphs with indefinite nodes and edges.

EchoScene encapsulates an information echo scheme to make every node in the graph exchange denoising data at each time step along the process, steadily achieving global graph constraints.

Texture-Rendering Demonstration

EchoScene

A bedroom

A Living room

EchoScene🤝SceneTex

"A bedroom in mid-century style"

"A Living room in French-country style"

EchoScene Schematic

EchoScene leverages a dual-branch diffusion model to create 3D scenes from scene graphs. Within this model, each node undergoes a denoising process in both branches. These processes incorporate global state awareness by utilizing layout and shape "echoes" - depicted as waves in various colors - facilitated by an information exchange unit, represented by a grey block, throughout the denoising phases.

Method:

I. Pipeline

We first evolved a scene graph to a contextual graph, which was introduced in CommonScenes. As shown in Figure. A, we yield the latent contextual graph (A.2) by encoding and optionally manipulating the relationship between each node in the contextual graph (A.1). Each node in the latent contextual graph encapsulates relation embeddings with other nodes.

Then, in Figure. B, these relation embeddings are sent to the diffusion-based layout branch (B.1) and shape branch (B.2) to generate bounding boxes as the scene layout and shapes of each object, respectively. Finally, the shapes are populated to the layout to synthesize the 3D scene, which can be textured using an off-the-shelf generator, SceneTex.

II. Information echo scheme

Inside the dual-branch architecture, there is an information echo scheme that makes the whole pipeline functional.

In the layout branch (Figure. A), information echo becomes "layout echo." A layout echo happens at each denoising step. Every node sends its current diffused bounding box, relation embedding, and current time to the layout exchange unit Ul to perform message passing and feature aggregation. The aggregated features for each node then echo back to their own denoising process, making the denoised bounding boxes aware of other nodes' geometric information.

In the shape branch (Figure. B), information echo becomes "shape echo." The dynamics of shape echo are similar to those of layout echo, except the diffused bounding boxes map to diffused shape codes. At each denoising step, shapes are aware of each other to make the global style (scene appearance) consistent.

If you feel that this work has helped your research a bit, please kindly consider citing it:

@article{echoscene,

title={EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion},

author={Zhai, Guangyao and {\"O}rnek, Evin Pinar and Chen, Dave Zhenyu and Liao, Ruotong and Di, Yan and Navab, Nassir and Tombari, Federico and Busam, Benjamin},

journal={arXiv preprint arXiv:2405.00915},

year={2024}

}