URDFormer

Constructing Interactive Realistic Scenes from Real Images via Simulation and Generative Modeling

Zoey Chen, Marius Memmel, Alex Fang, Aaron Walsman, Dieter Fox* and Abhishek Gupta*

University of Washington, Nvidia

*equal advising

Abstract:

Constructing accurate and targeted simulation scenes that are both visually and physically realistic is a significant practical interest in domains ranging from robotics to computer vision. However, this process is typically done largely by hand - a graphic designer and a simulation engineer work together with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are requisite of data-driven machine learning algorithms, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with “natural” kinematic and dynamic structure. To do so, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how effective generative models can be used in generating training data, the network can be inverted to map from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes with semantic and physical realism, enabling a variety of downstream applications in robotics and computer vision.

(Forward) Data Generation

We visualize examples of generated pairs for articulated objects, rigid objects and full scenes. The top row is the original synthetic images from the simulation, the bottom row is the realistic rendered paired RGB images.

(Inverse) URDF Prediction from Images

Object-Level Prediction

Edit Dist = 0.0

Edit Dist = 1.0

Mismatch: Base Type

Edit Dist = 0.0

Edit Dist = 1.2

Mismatch: Base Type & Part Mesh

(Cabinet base-> shelf, handles->knobs)

Edit Dist = 0.0

Edit Dist = 0.1

(Mismatch: Part Scale)

Edit Dist = 0.0

Edit Dist = 2.4

Mismatch: Part Mesh & Part Scale

Edit Dist = 0.0

Edit Dist = 0.2

Mismatch: Part Mesh (handles -> knobs)

Edit Dist = 0.0

Edit Dist = 0.80

Mismatch: Part Mesh & Part Scale

Edit Dist = 0.0

Edit Dist = 1.4

Mismatch: Part Mesh & Part Scale

Edit Dist = 1.0

Mismatch: Base Type

(cabinet base->fridge)

Edit Dist = 0.0

Edit Dist = 1.2

Mismatch: Part Mesh & Part Position

Edit Dist = 0.2

Mismatch: Part Position

(Handles)

Edit Dist = 1.5

Mismatch: Part Mesh

Edit Dist = 0.0

Edit Dist = 0.5

Mismatch: Mesh Type

(Oven Door->Down Door)

Edit Dist = 1.0

Mismatch: Base Type

(Dishwasher -> Fridge)

Edit Dist = 0.0

Edit Dist = 1.0

Mismatch: Base Type

(Fridge->Cabinet)

Edit Dist = 0.0

Full Scenes Prediction

Internet Images

URDF Predictions

Edit Distance = 8.0

Mismatch: Base types + Part Scales + Part Meshes

Edit Distance = 13.0

Mismatch: Base types + Part Meshes+Part Scales

Edit Distance = 9.0

Mismatch: Base types + global Scales

Edit Distance = 6.0

Mismatch: Basetypes + Global Position

Edit Distance = 14.0

Mismatch: Basetypes + part scales + global positions

Edit Distance = 8.0

Mismatch: Base types + Global Position

Edit Distance = 10.0

Mismatch: Global Position + part Mesh+base type

Experiment Details:

(1) Details for Dataset Assets

We procedurally generate scenes using both rigid and articulated objects. In particular, we collected 9 categories of common rigid objects in the kitchen and living room and 5 categories of common articulated objects for kitchens, and randomly rescale them during data generation.

Rigid objects (9):

Articulated objects (5): we randomly generate different configurations for articulated objects shown as below:

dishwasher

cabinet

Oven

washer

cabinet

fridge

Texture dataset

We collected 100 textures for cabinets, 5 textures for handles, and 5 textures for knobs to guide diffusion models to "transfer" the original texture into much more diverse textures. Examples of our texture datasets are shown below:

(2) Details for object/scene generation

(A) Texture Generation

We leverage text-to-image diffusion models to generate a much more diverse texture dataset from the original dataset. This helps us to create a dataset with realistic images that covers a wider distribution of objects in the real world.

(B) Object-level Generation: texture-guided part-by-part generation

We observe that directly applying diffusion models on the image level often ignores local details. For example, depth-guided or in-painting stable diffusion would change the original cabinets to completely different ones. Instead, we only use stable diffusion to change the style of texture and warp the original texture to each part of the object, and the masks of each part can be obtained directly from the simulator.

For global scene generation, we directly apply stable diffusion with the original synthetic RGB images and a text prompt. Note that this might change the category of some individual objects, e.g. Oven-> cabinet, thus we only use global scene dataset to predict the global positions of each bounding box and their parents (floor, front wall etc), and use the object-level dataset to predict object types and detailed kinematic structures within the object. Please check out more details in the main paper Section 3.2 and Section 3.3.

(D) Prompts Used to Guide Text-to-Image Generation

Textures

material prompt: 'bright', 'colorful', 'modern', 'multicolor', 'fancy color', 'accent', 'glass', 'chestnut', 'Oakwood', 'Maplewood', 'Cherrywood','Birchwood', 'Walnut', 'Mahogany', 'Pine', 'Beech', 'Ash', 'Hickory', 'Teak', 'Rosewood', 'Alder', 'Cedar', 'Bamboo', 'Plywood', 'Acacia', 'Poplar', 'fir'
full texture prompt: "a {material} wooden panel texture, high resolution, 4k, photorealistic".

Objects

"A {object_name}, nice detailed, fancy, photorealistic, inside a home, 4k, natural light"

Full Scenes

style prompt: "bright", "warm", "modern", "mediterranean", "vintage", "contemporary", "transitional"
kitchen: "a high-resolution picture of a bright {style} kitchen, very pretty, very natural lighting, ultra high resolution, 8k, 16k, natural light, photorealistic, realism."
living room: "a high-resolution picture of a bright {style} living room, with sofa, chairs, tv, ottoman, floor lamps, etc, very pretty, very natural lighting, ultra high resolution, 8k, 16k, natural light, photorealistic, realism".

Images:

To evaluate the effectiveness of our approach, we collect 300 images of articulated objects and 80 images of indoor scenes. The examples are visualized below:

(3) Details for Training URDFormer

Network: We use a pretrained vit-small-patch16-224 trained in (Radosavovic et al., 2023) as the vision backbone, which outputs the global image features dimensions of 14x14x384. To predict the base type, the global features are first max-pooled followed by a MLP to predict a class type over 14 object types. We then perform ROI alignment on cropped features with bounding boxes of the objects or parts. In ROI Alignment, we set the spatial_scale=1 / 16 and the sampling_ratio=2. The ROI size is set to 14. The roi aligned features are then fed into a 3-layer MLP followed by a norm layer. To compute positional encoder, we feed the bounding box coordinates into a 3-layer MLP as well as a norm layer. These normalized roi features together with the normalized spatial features are summed as the token features and feed into the transformer, which are then fed into MLPs to compute URDF primitives: position_start (relative to parent), position_end (relative to parent), mesh type, and parent-child relation matrix. Here instead of regressing to a position value, we treat it as a classification problem, where we discretize the x/y/z axis of the parent mesh into 12 bins. During training, the maximum sequence length is set to 32, which means the maximum number of bounding boxes per image is 32. We train URDFormer on one A-40 GPU with a batch size of 256.