Training SRGAN and Autoencoder was the most challenging task. There were a couple of new pieces of information that we learned during the model training that we would not have realized if we did not implement the architectures from scratch. We started with DIV2K dataset with only 800 training images. On training both Autoencoder and SRGAN, we realized that the models were not giving acceptable PSNR, which we believe could be due to two reasons:
Also, as shown in Fig. 8 , we can see that the generated images from SRGAN as well as Autoencoder had a strange "tint" to them. The pixelation from the low-resolution images was starting to smooth out, but the artifact remained (which was causing low PSNR values), even if we trained the model long enough. To circumvent this issue, we switched to two different datasets, namely, CelebA and Stanford-Cars. Each of the datasets had a substantially large number of training examples as compared to the DIV2K dataset. We hoped that the models could benefit from large datasets' by learning subtle features necessary to construct the inverse function. But, to our surprise, the "tint" was still polluting the generated images. It took a lot of time to debug why this was happening. Previously we thought that there is an issue with the implementation of the architectures. We also tried different hyperparameters and trained the models for 4000 epochs, but the artifacts did not diminish. Finally, we realized that because of the use of perceptual loss in both the models, we had to use the ImageNet's mean = [0.485, 0.456, 0.406] and standard deviation = [0.229, 0.224, 0.225] on not only the inputs to the generator but also on the output. It was surprising as usually, the last convolution layer follows an activation function, and the output values are not generally normalized.
Another challenge that we faced during model training was hyperparameter tuning. Since we were not using the same dataset used in the original SRGAN/Autoencoder paper [7], we could not use the same set of hyperparameters. It was challenging to find a balance between the MSE-content-loss, VGG-loss, and adversarial-loss weights. Because increasing the MSE loss was resulting in a high PSNR but there was an overall smoothening in the image, whereas if we increased the VGG loss, strange artifacts started to appear in the output images. Also, we were frequently getting kicked out of Google Co-Labs as a free account doesn't give the user good run-times and running on CPU takes exponentially longer, so it became really difficult to train the model for large epochs. For the CelebA dataset, each epoch took around 4 hours to execute, the model running with Stanford-Cars and DIV2K took 1.5-2 hours to execute a single epoch.