GAN Image Colorization and Superresolution

Jayden Ye, Matthew Wittwer, and Libby Maese

Abstract

Image colorization and superresolution present a number of challenges as computer vision problems. Deep Learning Models, especially Generative Adversarial Networks (GANs) have proven able to address these problems in novel and interesting ways, but individual GAN solutions can be difficult to train and compare. We created an accessible web demo that allows users to compare multiple GANs trained to address both superresolution and colorization simultaneously.

Motivation

Image colorization and single image superresolution are both ill-posed problems in the computer vision field which nonetheless hold an appeal that is obvious even to laypeople. Image colorization has a history that dates as far back as the invention of photography; hand-coloring of photographs was prevalent over a century before computational approaches were even possible. Meanwhile, the fantasy of being able to take a blurry image and immediately pull it into a higher resolution is so prevalent that scenes where characters demand technicians “zoom and enhance” on low quality footage have become cliché in modern media. However, until recently the ability to computationally accomplish either of these goals has been severely limited.

The development and spread of deep learning techniques has led to revolutionary development in solutions towards both of these problems. In particular, Generative Adversarial Networks (GANs) have been leveraged to develop various solutions to great effect. But while these techniques are the subject of heated debate and research within the field, their results remain largely inaccessible to the majority of the population. While there are commercial products available to address colorization or superresolution that tout their use of AI, they lack transparency and exist to sell a specific service without providing further understanding of the process. Our motivation was to create an accessible web demo that addresses superresolution and colorization simultaneously using the implementation of multiple models. This presentation allows users to compare results between the different models both by sight and by comparing the calculated quality metrics of the results. Additionally, our demo works by taking an existing high quality color image, downscaling and desaturating it, then applying the models to the resulting low quality black and white image. By comparing the original image to the results, the user can quickly ascertain what details may be changed or lost through each model implementation.

Previous Work and State of the Art

As previously stated, single image superresolution and image colorization are both inherently ill-posed problems; countless potential solutions exist when choosing a value to assign to a pixel, whether for colorization or image clarity.

Prior to model based approaches, colorization depended upon labor intensive workflows requiring humans to manually assign color image to image. Early automated approaches date back to the 1980s, where techniques such as luminance keying relied on manually created tables to autonomously assign color values to images [Yatziv]. Early machine-learning based approaches began in the early 2000s, and were largely driven by conventional, brute force machine learning techniques. In recent years there has been a shift towards deep learning based models [Anwar].

Superresolution has followed a similar trajectory towards deep learning models. Prior to the advent of machine-learning models, superresolution approaches depended on statistical analysis, patched- and edge-based methods, and sparse-representation approaches [Bashir]. Like colorization, superesolution has benefited greatly from advances in machine learning, with increasing advances in neural networks leading the way.

From the mid 2010s, superresolution research and colorization research have followed a similar trajectory. The first Conventional Neural Network (CNN) based approach towards colorization [Cheng] appeared only a year after the first superresolution CNN [Dong]. CNNs are capable of advanced, complex image processing. Various approaches have successfully implemented mappings on image-level, instance-level [Su], and pixel-level models [Cui]. In 2014, just as CNNs were first being applied to superresolution, the concept of Generative Adversarial Networks (GANs) was proposed [Goodfellow]. GANs function by creating and training two models: a generator and a discriminator. The generator attempts to create images that can "fool" the discriminator such that it cannot tell the difference between a case the generator has produced and a real case. In recent years, GANs have shown very robust results in automatically training colorization and superresolution models [Anwar] [Bashir].

Datasets

For training and testing purposes, we used the Large-scale CelebFaces Attributes (CelebA) Dataset [Liu]. This set contains over 200,000 headshots of various celebrities in different poses and configurations. We specifically used the aligned and cropped version of the dataset. We chose this set because we wanted to focus our training on human subjects and it provided a large set of images representing diverse faces and appearances.

Methods

In order to explore the problem of image coloring and image upscaling we reviewed multiple popular models and applied what we had found to a model of our own. Something that we found in our research is that while multiple purely deep convolutional neural network (DCNN) approaches have been tried on image to image tasks they have generally fallen behind GAN based models when looking for the realistic image generation that we wanted to explore in this project For comparison and to gain experience training and working with GANs we first experimented with two proven image to image models ESRGAN [Ledig] and Pix2Pix [Isola]. The performance of these models and the structures implemented in each went on to inform our decisions when it came time to create our own model for this specific image task.

ESRGAN

The Enhanced Super Resolution GAN was created specifically for the purpose of upscaling images to a higher resolution. This resulted in a model utilizing a backbone of dense blocks comprised of multiple convolutional layers followed by leaky ReLu layers with skip connections as depicted in Figure 1.

Figure. 1: Architecture of ESRGAN Dense Blocks

These dense blocks help extract much of the semantic information from the input image allowing for the final upscaling layers to have context for the material or pattern that was present in the image. This is especially useful when recovering information like hair and grass where high frequency pattern information is missing and must be inferred by the model. The recovered semantic information is then combined with the input image to ensure that noisy, high frequency, values that existed within the input are present before the upscaling occurs. The full architecture of the ESRGAN can be seen in Figure 2.

Figure. 2: Full ESRGAN Architecture

Finally the ESRGAN model implemented a simple single output relativistic discriminator based on a DCNN that performed a series of convolutional passed before determining a probability that the input image was real or fake. This probabilistic output can then be used as a loss to tune the generator model leading to more believable results.

Pix2Pix

Unlike the ESRGAN model that was specifically designed around the task of image upscaling the Pix2Pix model was designed to perform on multiple image to image tasks such as style transfer, image coloring, and edge detection along with specific types of image generation. This multipurpose model is already well adapted to performing image coloring and, as we found, it handles upscaling with good overall performance. The architecture of this model's generator is based on the U-Net [Ronneberger] architecture named after its 'U' shape as seen in Figure 3. This works by first encoding much of the high level semantic information by progressively down sampling the input image to a bottleneck. Then upscaling the encoded values and combining information from skip connections in order to maintain useful noisy data that would otherwise be lost in down sampling.

Figure. 3: U-Net Architecture similar to that used in Pix2Pix

Pix2Pix also introduced a DCNN discriminator to act as a loss on the generated images; however, unlike the ESRGAN this discriminator provides output patches instead of a single probability metric. This set of patches allows for individual regions to be tuned in the generator based on their ability to trick the discriminator instead of using a single loss over the entire image.

Custom Model

Taking inspiration from both of the high performing models that we studied and trained over the course of this project we created a model designed specifically for the combined task of image coloring and super resolution. The backbone of our model is also much smaller than either ESRGAN or Pix2Pix as we wanted to be able to both train and run the model efficiently on our hardware. The generator begins by downsampling using blocks made with two sets of convolutional 3x3, batch normalization, and ReLu layers. This downsampling is performed three times before the bottleneck layer at which point we upsample. Upsampling is performed similar Pix2Pix by passing features from a similar level in downsampling and then performing a convolution 3x3 followed by an upsampling convolution by 2x2. This is performed twice in order to return the input image resolution; however, this task is imbalanced as we must upscale further. For the final two upscaling blocks we perform the same layers as before without adding in information from the downsampling side. A visualization of the model described can be seen in Figure 4.

Figure. 4: Architecture used in our custom model designed to take in black and white images and return full color images upscaled 4x in each direction.

We also created a discriminator that that functions similarly to the Pix2Pix discriminator by returning patches corresponding to the predicted believability of regions within the generated image. This allowed our model to trained based on areas of particular interest like the eyes and mouth unlike the ESRGAN that treats the entire image equally.

Results

Figure 5: Formula for MSE, PSNR, and SSIM

While training the models we utilized objective metrics that are standard for computer vision tasks. Mean Squared Error (MSE) takes the average squared deviation between estimated values and the actual values. Peak Signal-to-Noise Ratio (PSNR) is based on MSE, PSNR is presented as the ratio between the maximum power of a signal and the power of deviation. The above-mentioned MSE and PSNR estimate absolute errors. Structural Similarity Index Measure (SSIM) captures more on the structural information, that the pixels have strong inter-dependencies especially when they are close. See their formula in Fig. 5. Together these metrics can be very useful in training machine learning models.

Using these objective metrics we are able to compare the performance of each of the models that we trained as seen in Fig 6. From these results we can see that each technique performs quite similarly on our test set. Pix2Pix performs best on MSE, ESRGAN performs best on the SSIM, and our model performs best on PSNR. However, the image that best fits these metrics is not necessarily the one that is most believable to human eyes.

Figure. 6: Quantitative Metrics used to compare the results of each model on a test set.

Each of the models trained in this project attempt to overcome the issue if imperfect metrics by utilizing discriminators that are trained alongside the model generator to be able to calculate the most believable looking image. The discriminator that both the Pix2Pix model and our model implement are patch based allowing the model to train specifically around regions that are less believable. Figure 7 shows an example result from a our models discriminator in the middle of training. From this example we can see the areas in black where the discriminator believes the image is fake focus around the face as well as the right side where a visual patch is clearly incorrect.

Figure. 7: Example of discriminator loss used to train models beyond standard image metrics. The image order from left to right goes original, downscaled, and generated with the top row visualizing the loss.

Demo

To use our demo, go here. To see how our demo works, you could watch the video above. Code for our models and demo can be accessed via our Github project.

Figure 8: Demo Input

Inspired by Gradio, we produced a Django-based interactive web demo of our colorization and super-resolution solution. Like Gradio, the tool is sufficiently self-explanatory and responsive. Unlike Gradio, we provide much more functionality so that users can have more understanding of our work.

In the demo (Fig. 8), all of our models are trained using the CelebA dataset. Users can upload their own photos. The system will automatically generate its downscale (1/16) and grayscale version as input, and then run through our pretrained models in real-time. For regular-size images, the running time is within 10 seconds. And for simplicity, users can click those samples on the left for a quick start.

We embedded multiple models, ESRGAN, pix2pix, and our own custom network within this demo. This demo allows excellent accessibility. Users can preview the results of each model. And by offering output from different models, a quick comparison can be made. To help users get a better understanding of different models, we add a slider module to each of the results generated. Initially, all the block shows the model result. Users can drag the slider to compare it with the ground truth. As shown in Figure 9, the left side denotes ground truth and the right side denotes the result generated by our model. This feature aims at reducing the memory load and vision constraints. With these costless transformations, our tool helps users quickly identify the good spots and bad spots. Beyond these subjective evaluations, we have as well provided some objective metrics to quantify our results (Figure 10).

Figure 9: Slider function

Figure 10: Sample result from our demo

Future Work

Training GANs can be particularly slow, and the models themselves run the risk of converging and becoming unstable. As we want to implement multiple models and run them simultaneously, we need to consider the challenges of performance. Therefore, we are working to optimize the models in our implementation through a variety of methods. Zhong, et al.’s FastGAN algorithm is one inspiration, recommending continual updates to the discriminator.

As shown above, our results still have a lot to improve. Even with the modern advanced ESRGAN model, the result is far from perfect. Sliding between the model result and ground truth, the difference is evident. One possible reason is that we have not utilized all of the Celeb dataset because of the time and memory constraints. Another reason is that our input is both colorless and low-resolution (1/16 of its original size). Even with the most advanced models, some training tricks and much more parameter tuning are needed. Ablation studies for approach/parameters might be interesting for future work.

For now, we only play with our application with celebrity face input. Previously we planned to try our model on some legacy black and white photos to generate more meaningful results. We could conduct a human perception subjective study on Amazon Mechanical Turk.

References

Anwar, Saeed, et al. "Image colorization: A survey and dataset." arXiv preprint arXiv:2008.10774 (2020).

Bashir, Syed Muhammad Arsalan, et al. "A comprehensive review of deep learning-based single image super-resolution." PeerJ Computer Science 7 (2021): e621.

Cheng, Zezhou, Qingxiong Yang, and Bin Sheng. "Deep colorization." Proceedings of the IEEE international conference on computer vision. 2015.

Cui, Meng-Yao, et al. "Towards natural object-based image recoloring." Computational Visual Media 8.2 (2022): 317-328.

Dong, Chao, et al. "Learning a deep convolutional network for image super-resolution." European conference on computer vision. Springer, Cham, 2014.

Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems 27 (2014).

Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Liu, Ziwei, et al. "Deep learning face attributes in the wild." Proceedings of the IEEE international conference on computer vision. 2015.

Ronneberger, O., P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Lecture Notes in Computer Science, pp. 234–241, 2015.

Su, Jheng-Wei, Hung-Kuo Chu, and Jia-Bin Huang. "Instance-aware image colorization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Yatziv, Liron, and Guillermo Sapiro. "Fast image and video colorization using chrominance blending." IEEE transactions on image processing 15.5 (2006): 1120-1129.