Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision
Xinchen Yan1    Jimei Yang2    Ersin Yumer2    Yijie Guo1    Honglak Lee1,3
1University of Michigan, Ann Arbor
2Adobe Research
3Google Brain
Understanding the 3D world is a fundamental problem in computer vision. However, learning a good representation of 3D objects is still an open problem due to the high dimensionality of the data and many factors of variation involved. In this work, we investigate the task of single-view 3D object reconstruction from a learning agent's perspective. We formulate the learning process as an interaction between 3D and 2D representations and propose an encoder-decoder network with a novel projection loss defined by the perspective transformation. More importantly, the projection loss enables the unsupervised learning using 2D observation without explicit 3D supervision. We demonstrate the ability of the model in generating 3D volume from a single 2D image with three sets of experiments: (1) learning from single-class objects; (2) learning from multi-class objects and (3) testing on novel object classes. Results show superior performance and better generalization ability for 3D object reconstruction when the projection loss is involved.

Torch Implementation: Code & Pre-trained Models
TensorFlow Implementation Link
* Note, if you want to use your own camera matrix, please refer to the section "using your own camera" here.

Network Architecture
As shown in the figure, our encoder-decoder network has three components: a 2D convolutional encoder, a 3D up-convolutional decoder and a perspective transformer layer. The 2D convolutional encoder consists of 3 convolution layers, followed by 3 fully-connected layers (convolution layers have 64, 128 and 256 channels with fixed filter size of 5x5; the three fully-connected layers have 1024, 1024 and 512 neurons, respectively). The 3D up-convolutional decoder consists of one fully-connected layer, followed by 3 convolution layers (the fully-connected layer have 3x3x3x512 neurons; convolution layers have 256, 96 and 1 channels with filter size of 4x4x4, 5x5x5 and 6x6x6). For perspective transformer layer, we used perspective transformation to project 3D volume to 2D silhouette where the transformation matrix is parametrized by 16 variables and sampling grid is set to 32x32x32. 

Perspective Transformer Layer

Video Demo

Video Demo: Results on chair category

Video Demo: Results on vehicle categories