Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision Xinchen Yan ^{1} Jimei Yang^{2} Ersin Yumer^{2} Yijie Guo^{1} Honglak Lee^{1,3}^{1}University of Michigan, Ann Arbor^{2}Adobe Research^{3}Google BrainAbstract Understanding the 3D world is a fundamental problem in computer vision. However, learning a good representation of 3D objects is still an open problem due to the high dimensionality of the data and many factors of variation involved. In this work, we investigate the task of single-view 3D object reconstruction from a learning agent's perspective. We formulate the learning process as an interaction between 3D and 2D representations and propose an encoder-decoder network with a novel projection loss defined by the perspective transformation. More importantly, the projection loss enables the unsupervised learning using 2D observation without explicit 3D supervision. We demonstrate the ability of the model in generating 3D volume from a single 2D image with three sets of experiments:
(1) learning from single-class objects; (2) learning from multi-class objects and (3) testing on novel object classes. Results show superior performance and better generalization ability for 3D object reconstruction when the projection loss is involved. Resources Torch Implementation: Code & Pre-trained Models TensorFlow Implementation Link * Note, if you want to use your own camera matrix, please refer to the section "using your own camera" here. Network Architecture As shown in the figure, our encoder-decoder network has three components: a 2D convolutional encoder, a 3D up-convolutional decoder and a perspective transformer layer. The 2D convolutional encoder consists of 3 convolution layers, followed by 3 fully-connected layers (convolution layers have 64, 128 and 256 channels with fixed filter size of 5x5; the three fully-connected layers have 1024, 1024 and 512 neurons, respectively). The 3D up-convolutional decoder consists of one fully-connected layer, followed by 3 convolution layers (the fully-connected layer have 3x3x3x512 neurons; convolution layers have 256, 96 and 1 channels with filter size of 4x4x4, 5x5x5 and 6x6x6). For perspective transformer layer, we used perspective transformation to project 3D volume to 2D silhouette where the transformation matrix is parametrized by 16 variables and sampling grid is set to 32x32x32. Perspective Transformer Layer Video Demo |