URDFormer 

Constructing Interactive Realistic Scenes from Real Images via Simulation and Generative Modeling

Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*

University of Washington, Nvidia

*equal advising

Abstract: 

Constructing accurate and targeted simulation scenes that are both visually and physically realistic is a significant practical interest in domains ranging from robotics to computer vision. However, this process is typically done largely by hand - a graphic designer and a simulation engineer work together with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are requisite of data-driven machine learning algorithms, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with “natural” kinematic and dynamic structure. To do so, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how effective generative models can be used in generating training data, the network can be inverted to map from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes with semantic and physical realism, enabling a variety of downstream applications in robotics and computer vision. 

(Forward) Data Generation

We visualize examples of generated pairs for articulated objects, rigid objects and full scenes. The top row is the original synthetic images from the simulation, the bottom row is the realistic rendered paired RGB images. 

(Inverse) URDF Prediction from Images

Edit Dist = 0.0

 Edit Dist = 1.0

Mismatch: Base Type


Edit Dist = 0.0

Edit Dist = 1.2

Mismatch: Base Type & Part Mesh

(Cabinet base-> shelf, handles->knobs)

Edit Dist = 0.0

Edit Dist = 0.1

(Mismatch: Part Scale)

Edit Dist = 0.0

Edit Dist = 2.4

Mismatch:  Part Mesh & Part Scale

Edit Dist = 0.0

Edit Dist = 0.2

Mismatch: Part Mesh (handles -> knobs)

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.80

Mismatch: Part Mesh  & Part Scale

Edit Dist = 0.0

Edit Dist = 1.4

Mismatch: Part Mesh & Part Scale

Edit Dist = 1.0

Mismatch: Base Type

(cabinet base->fridge)


Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 1.2

Mismatch: Part Mesh & Part Position

Edit Dist = 0.2

Mismatch: Part Position

(Handles)

Edit Dist = 1.5

Mismatch: Part Mesh

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.5

Mismatch: Mesh Type

(Oven Door->Down Door)

Edit Dist = 1.0

Mismatch: Base Type

(Dishwasher -> Fridge)

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 0.0

Edit Dist = 1.0

Mismatch: Base Type

(Fridge->Cabinet)

Edit Dist = 0.0

Edit Dist = 0.0

Internet Images

URDF Predictions

Edit Distance = 8.0

Mismatch: Base types + Part Scales + Part Meshes

Edit Distance = 13.0

Mismatch: Base types + Part Meshes+Part Scales

Edit Distance = 9.0

Mismatch: Base types + global Scales

Edit Distance = 6.0

Mismatch: Basetypes + Global Position

Edit Distance = 14.0

Mismatch: Basetypes + part scales + global positions

Edit Distance = 8.0

Mismatch: Base types + Global Position

Edit Distance = 10.0

Mismatch: Global Position + part Mesh+base type

Experiment Details:

(1) Details for Dataset Assets

We procedurally generate scenes using both rigid and articulated objects. In particular, we collected 9 categories of common rigid objects in the kitchen and living room and 5 categories of common articulated objects for kitchens, and randomly rescale them during data generation. 

dishwasher

cabinet

Oven

washer

cabinet

fridge

fridge

We collected 100 textures for cabinets, 5 textures for handles, and 5 textures for knobs to guide diffusion models to "transfer" the original texture into much more diverse textures. Examples of our texture datasets are shown below:

(2) Details for object/scene generation

(A) Texture Generation

We leverage text-to-image diffusion models to generate a much more diverse texture dataset from the original dataset. This helps us to create a dataset with realistic images that covers a wider distribution of objects in the real world. 

(B) Object-level Generation: texture-guided part-by-part generation

We observe that directly applying diffusion models on the image level often ignores local details. For example, depth-guided or in-painting stable diffusion would change the original cabinets to completely different ones. Instead, we only use stable diffusion to change the style of texture and warp the original texture to each part of the object, and the masks of each part can be obtained directly from the simulator. 

(C) Full Scenes Generation

For global scene generation, we directly apply stable diffusion with the original synthetic RGB images and a text prompt. Note that this might change the category of some individual objects, e.g. Oven-> cabinet, thus we only use global scene dataset to predict the global positions of each bounding box and their parents (floor, front wall etc), and use the object-level dataset to predict object types and detailed kinematic structures within the object. Please check out more details in the main paper Section 3.2 and Section 3.3. 

(D) Prompts Used to Guide Text-to-Image Generation

"A {object_name}, nice detailed, fancy, photorealistic, inside a home, 4k, natural light"

(c) Evaluation Dataset

To evaluate the effectiveness of our approach, we collect 300 images of articulated objects and 80 images of indoor scenes. The examples are visualized below:

(3) Details for Training URDFormer