Conditional GAN architecture
In this work, I have used a Conditional Generative Adversarial Network (cGAN) for image-to-image translation task. The approach uses two separate ConvNets (called as Generator and Discriminator) with batchnorm layers and ReLU activation layers. The Generator(G) ConvNet is an encoder-decoder structure with skip connections, designed to generate realistic fake images taking an RGB image from database and a noise vector z as inputs. The Discriminator (D) network classifies randomly picked images as fake or real with a cross-entropy loss. The Generator is expected to produce images close to the ground truth, while the discriminator is supposed to distinguish between fake images and the real images. Hence in a sense, the objectives of these two networks are opposed to each other.
Training a cGAN involves a few steps. Initially, the discriminator is trained on real and fake depth images with the correct labels for few epochs. Following this, the generator is trained using the predictions from the trained discriminator as its objective. This procedure is repeated for few epochs until the generated fake depth maps are difficult to distinguish from the real depth maps. The cGAN architecture is illustrated in the figure. The approach also incorporates L1 loss to generate better near ground truth images. Results of the depth prediction on distinct images are shown above.