To get a deeper understanding of our approaches and results please visit 'Problem Statement' and 'Our Approach' sections. A summary of our deductions along with discussion has been added in the 'Conclusion' section.
In this section, we will present -
Comparison of different models in different learning settings
The Images colorized by our best model
Comparison of our best model's closeness score with the state of the art methods
Applying our model on a short video
Note : Lower closeness score translates to better performance
Comparison of models in different learning settings
Vanilla Unet performs best among all the other models for the same settings
Addition of high level extracted features to bottleneck layer does not yield improved performance
Moderately complex network is needed to generalize well on a test set
Low values of lambda leads to degraded performance
Comparison of different loss functions used for training
Using perceptual loss along with per-pixel loss leads to improved performance in all tested cases
Per-pixel loss calculated using L1 loss performs better than L2 loss
Loss function for optimal training is (L1 per-pixel loss + perceptual loss)
Fractional Strided Convolution vs Upsampling
Fractional Strided Convolution performs better than upsampling
Fractional Strided Convolution learns the upsampling parameters during training and hence can generate higher quality results
Upsampling is relatively faster than fractional strided convolution as it uses a pre-defined strategy for interpolation
Deep Unet vs Shallow Unet
When using a deeper architecture for Unet, we obtained much better results than the shallow counterpart
The performance of deeper network is even better in case of large dataset
Images from test set of 102Flowers dataset
Images from test set of Doraemon cartoon series
Images from test set of peppa pig cartoon series
The method used by Iizuka et al. [6] tries to minimize the loss by choosing dull colors whenever the model is unsure about the object or texture
The method used by Zhang et al. [7] leads to bright and colorful images. The high score of this model is due to its bright choices. If the same is not reflected in the ground truth images then the closeness score becomes high.
As opposed to the other two, our model was trained on 17 flowers dataset. The final score for all the methods has been calculated from the test set of 17 flowers dataset. Due to this reason the results are slightly biased in our favor.
Images produced by our network
Images colorized by Iizuka et al. [6]
Images colorized by Zhang et al. [7]
To perform colorization, we extracted all the frames from the target grayscale video and applied the colorization model to each frame. Finally, we converted the colored images back into a video.