Learning to Infer 3D Object Models from Images

Chang Chen*, Fei Deng*, Sungjin Ahn

Rutgers University

[paper]

Introduction

At the core of human intelligence is the ability to build up models of individual 3D objects from partial observations of the world. Our proposed model, ROOTS, is the first probabilistic generative model that has the similar ability. Given partial observations of a 3D scene containing multiple objects with occlusion, ROOTS is able to decompose the scene into objects, and infer a 3D object model for each object capturing its complete 3D appearance. The inferred object models support rendering of individual objects from arbitrary viewpoints, allowing novel scenes to be composited.

ROOTS Pipeline

ROOTS Encoder (a - c): (a) Context observations are encoded and aggregated into a scene-level representation. (b) The scene-level representation is reorganized into a feature map of the 3D space, from which 3D center positions are predicted for each object. By applying perspective projection to the predicted 3D center positions, we identify image regions for each object across viewpoints. (c) Object regions are cropped and grouped into object-level contexts. Object Models (d): The object-level contexts allow us to obtain the 3D appearance model of each object through an object-level GQN. ROOTS Decoder (e - f): To render the full scene for a given query viewpoint, we composite the rendering results of individual objects.

Note that the entire pipeline is trained end-to-end without supervision.

Object Models



From images containing multiple objects with occlusion, ROOTS is able to learn the complete 3D appearance model of each object, predict accurate object positions, and correctly handle occlusion.

We visualize the learned object models from a set of query viewpoints, as well as the full scene generations composited from them. Green bounding boxes indicate the predicted object position and scale.

Disentanglement



ROOTS explicitly disentangles position and appearance in the learned object models. Hence, by manipulating the position latent, we are able to move objects around without changing other factors like object appearance.

We visualize the position latent traversals of the yellow ball and the blue cylinder from two query viewpoints. It can be seen that the change of one coordinate does not affect the other. In addition, the appearance of the moving objects remains complete and clean during the traversal. Other untouched components (the purple ball and the background) remain unchanged. Moreover, we also notice some desired rendering effects. For example, the size of objects becomes smaller as they move further away from the camera.

Compositionality


Once object models are learned, they can be reconfigured to form novel scenes that are out of the training distribution.

Here we show a novel scene with six objects generated by combining object models learned from two scenes with three objects each. Notice that the training dataset only contains scenes with no more than three objects.

More Generation Samples

We show generation samples (above) and learned object models (below) on the more challenging Multi-Shepard-Metzler dataset. Each scene contains 2-4 randomly positioned Shepard-Metzler objects, each consisting of 5 randomly colored cubes.