Xinchen Yan1 Jimei Yang2 Kihyuk Sohn3 Honglak Lee1
1University of Michigan, Ann Arbor
2Adobe Research, 3NEC Labs
This work investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.
Full version on ArXiv: Paper
Torch Implementation: Code
Network Architecture
We design a two-stream convolutional encoder-decoder architecture for foreground and background generation. The foreground encoder network consists of 5 convolution layers, followed by 2 fully-connected layers (convolution layers have 64, 128, 256, 256 and 1024 channels with filter size of 5x5, 5x5, 3x3, 3x3 and 4x4, respectively; the two fully-connected layers have 1024 and 192 neurons). The attribute stream is merged with image stream at the end of the recognition network. The foreground decoder network consists of 2 fully-connected layers, followed by 5 convolution layers with 2-by-2 upsampling (fully-connected layers have 256 and 8x8x256 neurons; the convolution layers have 256, 256, 128, 64 and 3 channels with filter size of 3x3, 5x5, 5x5, 5x5 and 5x5. The foreground prediction stream and gating prediction stream are separated at the last convolution layer.
Conditional Generation: Side-by-Side Comparison
Attribute-Conditional Image Progression
Image Reconstruction and Completion
Video Demo