Super Resolution: A Deep Dive - Final Project Report

ABSTRACT

With the advent of high-fidelity monitors and streaming services, consumers today want the best visual graphics, be it for gaming or videos. Sending high-resolution images directly to the consumers leads to a lot of utilization of bandwidth. This issue can be resolved through super-resolution. If a low-resolution image is converted to a high-resolution image at the consumer end, then this can lead to saving a lot of bandwidth. We have deployed two different deep-learning techniques, Autoencoders and Generative Adversarial Networks, to perform the super-resolution task. Three different datasets have been used to observe how these two techniques work in terms of quality of the output image (by employing peak signal-to-noise ratio) and computation complexity. The main idea behind the project is to examine the dependence of the inverse reconstruction function (used to transform a low-resolution image to a high-resolution image) on the dataset that each model is trained on. We also discuss the challenges faced in working with different datasets and training different architectures. We have observed cross-database testing as well, to see if the inverse function learned by the model is dependent on the data that it is trained on or not.

INTRODUCTION

With the increase in the number of high-fidelity monitors and streaming services, such as Netflix, Prime Video, Hulu, etc. consumers want to see higher resolution images and videos but sending data in 4k resolution will result in a lot of bandwidth being utilized. On the other hand, if a lower resolution image is sent and then super-resolved at the consumer-end, then a lot of bandwidth can be saved. An image is said to be of low resolution (LR) if it is of a smaller size. It is easy to go from a high-resolution (HR) image to a low-resolution image, but the inverse can be tricky. This is because the degradation function needed for this is not known beforehand and hence, getting its inverse becomes a complication. Another reason why it is a difficult problem to tackle is that one LR image can correlate to many possible HR images in a high-dimensional space [1]. Despite this dilemma, deep learning methods are being used to predict high-resolution images and are also efficient in achieving high fidelity when converting a low-resolution image to a high-resolution image.

Super-resolution (SR) is being extensively researched by companies, such as NVIDIA who has been working on Deep Learning Super Sampling (DLSS). Research on super-resolution has been active since the 1980s [2] and various approaches have been proposed which include aspects from signal processing to machine learning. Nowadays, research is being done primarily in the spatial domain as opposed to earlier where it was done in the frequency domain. This is because many types of image degradations can be modeled in the spatial domain whereas the approaches are constrained in the frequency domain. Many superior approaches use deep learning methods to achieve complex mapping among LR and HR images. For example, in [3] the authors have built their algorithm on top of another training-based super-resolution algorithm and made it faster and simpler. Convolutional neural network (CNN) based algorithms are also showing a lot of promise such as in [4] and [5]. Autoencoders, especially Variational Autoencoder, have been successful in achieving unsupervised super-resolution as well [6]. Additionally, Generative Adversarial Networks have evolved to become one of the more popular techniques to achieve super-resolution like in [7] where the authors have presented a super-resolution GAN to produce images with high fidelity and to recover the high-frequency details (texture details) generally missing in other algorithms.

In our project, we have approached the super-resolution problem via two different approaches, namely, Autoencoders and Generative Adversarial Networks. We will look in detail how each of these techniques works in terms of time taken for execution, quality of the output image in terms of signal to noise ratio (SNR), computation complexity as well as the challenges we faced while implementing these models from scratch.

Approach

SRGAN

In SRGAN, two neural networks are pitted against one another namely the generator and the discriminator. A simplified design on how the SRGAN works is shown in Fig. [1]. From the figure, it can be seen that the LR image is fed to the generator that upscales it to an SR image. On the other hand, the discriminator is fed with the “real ” HR image and the “fake” generated SR image so that it can learn to discriminate between the two. The loss from the model is back-propagated to both the discriminator and generator. This helps to train the generator to produce more realistic SR images and the discriminator to more efficiently distinguish between a “real ” HR image and a generated “fake” SR image. For example, we can think of the generator as a currency forger whose main aim is to produce the most realistic “fake” copy of the real currency. On the other hand, a discriminator is like the U.S Treasury department whose job is to analyze currency and discriminate between real and counterfeit bills. During training, we pit the forger against the treasury department where with each iteration the forger learns to produce better counterfeit bills, and the treasury department learns to discriminate between the two even better. Therefore, the discriminator will try to maximize the probability that a given input image is real whereas on the generator side we want to maximize the likelihood of fooling the discriminator or minimizing the output of the discriminator, i.e. creating more realistic "fake" SR images. This min-max game between the generator and the discriminator can be given from Fig. 2. [8].

Fig. 1. Generative Adversarial Network (GAN) structure*

Fig. 2 Loss function to be optimized [7]

Here, D(x) is the output of the discriminator when a “real ” HR image is passed through the it, whereas D(G(z)) is the output of the discriminator when the generated “fake” SR image is passed through it. Fig. [3] depicts the architecture of the generator and the discriminator used in the project, the model architecture is inspired by [7].

Generator

Firstly, the LR image is fed as input to the generator followed by a convolution block with a kernel size = 9, stride = 1, number of output channels = 64, and PReLu activation.
Following this block, there are ‘B’ numbers of residual blocks which all consist of convolution layers and batch normalization.

conv2d --->batch normalization--->PReLU--->conv2d--->batch normalization

The residual blocks are followed by another convolution layer with kernel size = 3, stride = 1, and output dimensions of 64.
Two more convolution blocks with PixelShuffler and PReLU activation are used. The last convolution layer with kernel size =9, output dimension = 3, and stride = 1 is used along with Sigmoid() activation to produce the SR image.

Fig. 3. SRGAN Architecture [7]

In [7], the total generator loss has been calculated by doing a weighted addition of two different losses, namely the adversarial loss and the content loss. We have also calculated the total loss for the generator similarly, but with a couple of changes (as described below) in the hope to train our model better. We have also included another loss to this mix called the total variation loss.

Adversarial loss is the MSE loss (Fig. 4) between the feature map extracted when the generated image is passed to the discriminator and a tensor of ones of the same size.
Usually pixel-wise L2 loss (i.e. the MSE loss between the HR “real” image and the generated “fake” SR image) is calculated as the content loss, but this causes a loss of high-frequency content which can result in an overall smoothening of the SR image. A new loss, VGG loss [9] has been defined. In this, the HR “real” image and the generated “fake” SR image is passed through the pre-trained VGG-19 network and Euclidean loss is then calculated from the feature maps obtained from one of the intermediate convolution layers. The VGG loss can be mathematically written as Fig. 5.
Hence, in our implementation, we have used both pixel-wise MSE loss in conjecture with VGG loss. For the VGG loss, we have taken the feature map after the first 18 layers from the pretrained vgg19 network for extracting the feature maps. The amount of emphasis given to each of the losses is regulated by weight coefficients.
Total variation loss is used to compute how much noise is present in an image. It is the absolute difference from the adjoining pixel values of an image. In our implementation, we have used total variation loss as a regularizer.

Total perceptual loss = adversarial loss * 10^-3 + content loss * 6e-3 + total variation loss * 2e-8

Fig. 4. MSE loss. Here, Y is the the target and Y(hat) is predicted values.

Fig. 5. VGG loss. Φ_i,j represents the feature map obtained by the j-th convolution layer (after activation) and before the i-th max-pooling layer. I^HR is the “real” high resolution image and G_ϴ(I^LR) is the output of the generator when fed with a low resolution image. [7]

Discriminator

After the generator has created a “fake” SR image, it along with the “real” HR image are fed to the discriminator one after another to train it.
Instead of using fully connected layers followed by a sigmoid activation, feature maps have been directly used in an attempt to improve the performance by calculating feature-wise loss using the MSE loss function.

Total discriminator loss is calculated as the average of two MSE losses described below, similar to [10]:

First MSE loss is between the feature map extracted when the “real ” HR image is passed to the discriminator and a tensor of ones of the same size.
Second MSE loss is between the feature map extracted when the “fake” generated image is passed to the discriminator and a tensor of zeros of the same size.

SRGAN-CelebA Implementation on Co-lbs

Autoencoder

Autoencoders are semi-supervised neural networks that are used as denoisers for super-resolution tasks. They consist of encoders and decoders where the encoder is responsible for compressing the image to a lower resolution while the decoder reconstructs the data to get back the original HR image. Fig.6 present the general schema of an autoencoder. So in essence, the "code" in the autoencoder tries to learn the inverse function to reproduce an LR image back to a HR image.

Fig. 6. Autoencoder general schema*

Fig. 7 depicts the architecture of the deep autoencoder that is used in our project. This model is a modified version of the network presented in [11]. The encoder consists of 5 convolution layers and 2 max-pooling layers while the decoder consists of 2 up-sampling layers, 5 convolution layers, and a Sigmoid activation function. Similar to SRGAN, we have used MSE and VGG loss to train the autoencoder. The MSE loss measured the difference in the pixel values of the "fake" image and the "real" image, whereas the VGG loss measured the difference in the features generated when "real" and "fake" images were passed through the VGG19 network. Also, total variation loss was used as a regularizer.

Fig. 7. Autoencoder architecture

Autoencoder Implementation in Co-labs

Experimental Setup

In the following section we will go over the the different datasets that we used to train the SRGAN as well as the Autoencoder. Some example images from each dataset is also presented. We will also be looking at the performance matric used to analyze the efficacy of our implementations.

DATASETS

CelebFaces Attributes Dataset (CelebA) dataset: This dataset has also been used for the SRGAN implementation. It consists of more than 200K celebrity images [13], adding to a total of 126252 images which cover background clutter and large pose variations.

Training data: Consists of 113,626 images
Validation data: Consists of 12,626 images

Fig. 8. CelebA[13] dataset examples

Stanford-Cars Dataset: This dataset [14] was used for the Autoencoder implementation. We have used a total of 8,144 images splitting it in the ratio of 9:1 for training and validation respectively.

Training data: Consists of around 7k images
Validation data: Consists of around 1k images

Fig. 9. Stanford-Cars[14] dataset examples

DIV2K dataset – This dataset [12] has been used in the SRGAN implementation.

Training data: Consists of 800 high definition high-resolution images and corresponding low-resolution images with downscaling factors of 2, 3 and 4
Validation data: Consists of 100 high definition high-resolution images and corresponding low-resolution images

Fig. 10. DIV2K[12] dataset examples

The reason we choose to train SRAGN and Autoencoder on two different datasets was to analyze the dependence of the training data on learning the inverse function needed to reconstruct the HR image. More precisely, we wanted to conduct cross-database testing on the two models.

PERFORMANCE METRIC

We have used Peak Signal-to-Noise Ratio (PSNR) to evaluate how well both the approaches are working. PSNR measure how much the generated/fake image deviates from the real HR image. It can be computed as the ratio of maximum possible pixel value of the image (signal strength) to maximum mean squared error (MSE) between the original image and its generated version, expressed in logarithmic scale [7]. The closed form expression of the PSNR is given in Fig. 11.

Fig. 11. PSNR calculation*. Here, MAX_I is the maximum possible pixel value of the image and MSE is the mean squared error between the "real " HR image and the generated "fake" image.

HYPERPARAMETERS, OPTIMIZER & DATA AUGMENTATION

We trained the SRGAN and Autoencoder with Adam Optimizer with a dynamic learning rate. We used Pytorch's "torch.optim.lr_scheduler.ReduceLROnPlateau" scheduler to dynamically change the value of the learning rate if the model stopped learning. The scheduler monitored the value of PSNR and reduced the learning rate by a factor of 0.1 if the metric did not improve for 10 epochs. The initial learning rate was set to 1e-3. We also added "weight decay" regularization of 1e-5 to the optimizer to make the model more robust. We trained the SRGAN model with CelebA dataset for 6 epochs, whereas for the DIV2K dataset the model ran for 1676 epochs. For the autoencoder, we ran the model for 570 epochs. The batch size of 32 was chosen.

Many datasets have images in the range of tens of thousands, but in those cases where we have a limited number of images (say a few hundred) it becomes imperative to deploy data augmentation techniques. Our DIV2K dataset only has 800 training images and the Cars dataset also has only around 7k images for training. Hence, in these two datasets, we have implemented data augmentation.

For the Standford-Cars dataset, we have carried out random rotation and random flipping. This was done to ensure that our neural network had all variety of conditions to perform better.
For the DIV2K dataset, we have carried out random cropping, random rotation, and random flipping. As this dataset only had 800 training images, we needed more images to feed the training network to achieve better efficacy.

Challenges

Training SRGAN and Autoencoder was the most challenging task. There were a couple of new pieces of information that we learned during the model training that we would not have realized if we did not implement the architectures from scratch. We started with DIV2K dataset with only 800 training images. On training both Autoencoder and SRGAN, we realized that the models were not giving acceptable PSNR, which we believe could be due to two reasons:

The models were not deep enough to extract relevant features from the limited DIV2K dataset
The model training was not done for enough epochs

Also, as shown in Fig. 8 , we can see that the generated images from SRGAN as well as Autoencoder had a strange "tint" to them. The pixelation from the low-resolution images was starting to smooth out, but the artifact remained (which was causing low PSNR values), even if we trained the model long enough. To circumvent this issue, we switched to two different datasets, namely, CelebA and Stanford-Cars. Each of the datasets had a substantially large number of training examples as compared to the DIV2K dataset. We hoped that the models could benefit from large datasets' by learning subtle features necessary to construct the inverse function. But, to our surprise, the "tint" was still polluting the generated images. It took a lot of time to debug why this was happening. Previously we thought that there is an issue with the implementation of the architectures. We also tried different hyperparameters and trained the models for 4000 epochs, but the artifacts did not diminish. Finally, we realized that because of the use of perceptual loss in both the models, we had to use the ImageNet's mean = [0.485, 0.456, 0.406] and standard deviation = [0.229, 0.224, 0.225] on not only the inputs to the generator but also on the output. It was surprising as usually, the last convolution layer follows an activation function, and the output values are not generally normalized.

Another challenge that we faced during model training was hyperparameter tuning. Since we were not using the same dataset used in the original SRGAN/Autoencoder paper [7], we could not use the same set of hyperparameters. It was challenging to find a balance between the MSE-content-loss, VGG-loss, and adversarial-loss weights. Because increasing the MSE loss was resulting in a high PSNR but there was an overall smoothening in the image, whereas if we increased the VGG loss, strange artifacts started to appear in the output images. Also, we were frequently getting kicked out of Google Co-Labs as a free account doesn't give the user good run-times and running on CPU takes exponentially longer, so it became really difficult to train the model for large epochs. For the CelebA dataset, each epoch took around 4 hours to execute, the model running with Stanford-Cars and DIV2K took 1.5-2 hours to execute a single epoch.

Fig.8. "Tint " artifact in the generated "fake " images.

experiment & results

In this section, we will go over the results obtained from our experiments. Firstly we will discuss the results obtained from the SRGAN model followed by Autoencoder. We will also look into intra-dataset and cross dataset testing results. In cross-dataset testing we have used the weights from one model trained on a dataset to reconstruct images from a different dataset. This will help us understand how well the the network is generalizing to data that has a different distribution as compared to the data it is trained on.

SRGAN on CelebA Dataset

Following are the plots- PSNR, Discriminator Loss, Generator Adversarial Loss and Generator Content Loss. For the plots, the epochs are on the x-axis. The results obtained from intra-dataset testing are shown in Fig. 13. We can see from the figure that generator is able to reconstruct the low resolution image of size 56x56 to a high resolution image of size 224x224. There is still some noise associated with the "fake" generated image but we believe that with enough training this will be eradicated. The average PSNR of the test set was achieved as 22.68 dB.

Fig. 9. PSNR(dB) vs. epochs

Fig. 10. Discriminator loss vs. epochs

Fig. 11. Generator Adversarial loss vs. epochs

Fig. 12. Generator content loss vs. epochs

Fig. 13. SRGAN generator output on CelebA test dataset

Cross-Dataset Testing

Fig. 14 depicts the results when the SRGAN trained on CelebA dataset is tested on the Stanford-Cars dataset. The results are not at par with the ones we achieved from intra-dataset testing. The average PSNR for the test images was 11.27 dB. The reconstruction is still better than simply using Bi-cubic interpolation.

Fig. 14. Cross-dataset testing of SRGAN generator model trained on CelebA dataset on Stanford-Cars dataset

Autoencoder on Stanford-Cars Dataset

Fig. 15 Autoencoder total loss vs epochs.

Fig.15/16/17/18. depict various losses associated with Autoencoder. For the plots, the epochs are on the x-axis. Fig. 19 depicts the output from the autoencoder when fed with the test set of the Stanford-Cars dataset. We achieved an average PSNR of 7.83 dB on the test dataset. From the images, it is clear that the autoencoder can denoise the image successfully as the edges are preserved in the output images. But some artifacts need to be handled for output images to be a close match to the "real " HR images. We believe that more fine-tuning to the hyperparameters is required to boost the model's performance.

Fig. 16. PSNR (dB) vs. epochs

Fig.17. Content loss vs. epochs

Fig. 18. MSE loss vs. epochs

Fig. 19 Autoencoder output on Stanford-Cars dataset

Cross-Dataset Testing

Fig. 20. depicts the cross-dataset testing of autoencoder trained on Stanford-Cars dataset and tested on CelebA dataset. The results are a bit surprising. The average PSNR of the test CelebA images came out as 16.48 dB. It should be noted that the output images are not properly normalized which could be the cause of the blue tint on all the images shown below. This is happening only when we are doing cross-dataset testing on autoencoder with CelebA (we have used the same plotting function to print all the output images above). We tested the model with a different set of images (from the DIV2K dataset) and the results are presented in Fig. 21. It can be seen that the colors are properly reproduced at the output and there is also a good amount of denoising in the images, but the images are polluted by artifacts, which are more prominent in Fig. 21 as compared to Fig. 20. The reason for this discrepancy is an interesting aspect of future research on this topic. We believe that the autoencoder is inherently learning information about the underlying data, which is causing this discrepancy.

Fig. 20 Cross-dataset testing with CelebA dataset

Fig. 21. Cross-dataset testing with DIV2K dataset

Looking at the PSNR of intra-database testing (Table 1) of SRGAN and Autoencoder, we can see that the model showed decent results. We expected that since both techniques performed well in intra-database testing, they would perform in cross-database testing as well. To our surprise, the average PSNR was lower in both cases, as shown in Table 1. We believe that this difference in the performance could be attributed to the fact that the inverse reconstruction function learned by the models is somewhat correlated with the data distribution on which it is trained or the number of epochs we ran in the model was insufficient. But more training on different datasets needs to be done to safely come to any conclusion.

Table 1. PSNR values on Intra-database and Cross-database testing

CONCLUSION AND FUTURE WORK

We have compared two dissimilar approaches, namely Autoencoders and SRGAN, to carry out super-resolution for a single image using three different datasets. The work was aimed to get the inverse degradation function so that a low-resolution image could be converted to a high-resolution image. It was surprising to see that this inverse function is dependent on the images of the dataset that the model was being trained on. On implementing SRGAN on DIV2K, since the number of images were less (only 800), the model was not showing high PSNR as compared to using the CelebA dataset where the PSNR was higher in the same number of epochs, and hence the reconstruction of images was better with CelebA dataset. As a result, DIV2K required a large number of epochs for the model to give decent PSNR. Unfortunately, due to having limited computing power, the model was not able to converge in the way we expected (the code for this model can be found on clicking the button below). On the other hand, our CelebA model was able to converge well. On having more computing power, we could have achieved higher fidelity. On implementing autoencoder, the model was able to converge and also show decent results.

On studying SRGAN and autoencoder, both had good results (in terms of PSNR) on intra-database testing. During cross-database testing, SRGAN showed a lower average PSNR. We assume that this is due to the inverse reconstruction function being correlated to the data it was trained on. Looking at autoencoders, it was surprising to see that on cross-database testing with CelebA dataset, the PSNR was quite high but there was a curious blue tint on the images. Due to this, we tested the autoencoder on the DIV2K dataset but there was no blue tint and the reconstruction was being done properly. One huge difference though, in autoencoder and SRGAN, is the presence of artifacts on the reconstructed images of autoencoder which are missing in the reconstructed images of SRGAN. This is due to there being more complexity in the SRGAN model as compared to autoencoder which is why training also took longer for it.

To make super-resolution techniques better, more deeper models can be realized to get rid of artifacts present in the generated images. Another aspect of importance to look at, while experimenting with different models, is the type of loss being used. We have used perceptual loss as a sum of three different losses, namely adversarial, content and total variation loss. These losses can be modified or different losses can be used to test better results. The loss function chosen is important in terms of what the final objective of the model is. For example, for medical or surveillance purposes, finer details may not be of much importance and hence a loss used for better texture details can be disregarded. On the other hand, for getting higher resolution videos while streaming movies etc., we need to focus on losses that will reconstruct images with the high-frequency components. Additionally, a more general model needs to be made and training that model on various datasets needs to be done to observe the dependence of the inverse reconstruction function with the images of a dataset.

SRGAN-DIV2K Implementation Co-labs

REFERENCES

W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue and Q. Liao, "Deep Learning for Single Image Super-Resolution: A Brief Review," in IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3106-3121, Dec. 2019, doi: 10.1109/TMM.2019.2919431.
R. Y. Tsai and T. S. Huang. Multiple frame image restoration and registration. In Advances in Computer Vision and Image Processing, pages 317–339. Greenwich, CT: JAI Press Inc., 1984
W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Computer Graphics and Applications, 22(2):56–65, 2002
Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks for image super-resolution with sparse prior. In IEEE International Conference on Computer Vision (ICCV), pages 370–378, 2015. 3
C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016
Liu, Zhi-Song & Siu, Wan-Chi & Wang, Li-Wen & Li, Chu-Tak & Cani, Marie-Paule & Chan, Yl. (2020). Unsupervised Real Image Super-Resolution via Generative Variational AutoEncoder
Ledig, Christian & Theis, Lucas & Huszar, Ferenc & Caballero, Jose & Cunningham, Andrew & Acosta, Alejandro & Aitken, Andrew & Tejani, Alykhan & Totz, Johannes & Wang, Zehan & Shi, Wenzhe. (2017). Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. 105-114. 10.1109/CVPR.2017.19
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
Ledig, Christian & Theis, Lucas & Huszar, Ferenc & Caballero, Jose & Cunningham, Andrew & Acosta, Alejandro & Aitken, Andrew & Tejani, Alykhan & Totz, Johannes & Wang, Zehan & Shi, Wenzhe. (2017). Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. 105-114. 10.1109/CVPR.2017.19
Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections" (https://arxiv.org/abs/1606.08921)] by Xiao-Jiao Mao, Chunhua Shen, Yu-Bin Yang
DIV2K dataset (https://data.vision.ee.ethz.ch/cvl/DIV2K/)
CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
Stanford-Cars dataset (https://ai.stanford.edu/~jkrause/cars/car_dataset.html)

*creative commons license

Page updated

Report abuse