In this study, we were able to understand the impact of different aspects of a learning system for Image Colorization. Secondly, we compare our best model with the state of art models. Lastly, we trained our model on a larger dataset (12,000 Images) and showcase the results by colorizing a grayscale video.
Incorporation of perceptual loss in Generator loss leads to improved learning and better test results
We experimented with 6 different configurations of loss function as described in 'Our Approach'
We found that incorporation of perceptual loss in addition to per-pixel loss leads to better results in all our experiments
The best setting for the loss was - (L1 per-pixel loss + perceptual loss)
This may be because perceptual loss encourages learning of style and textures in addition to finer details
Unet architecture performs better than Resnet under all tested conditions for Image Colorization
We compared Resnet against Unet in our experiments. We used different values of lambda and network complexity keeping other essential hyperparameters same in all the experiments
Unet constantly outperformed Resnet for the task of Image Colorization
It seems to be a result of longer skip connections which allow decoder to understand the high level features learned during encoding phase
High values of lambda (above 50) leads to over fitting on smaller training set
When we used high values of lambda then our model tended to overfit. Additionally, the advantages of using a GAN were lost for higher values as the model was equivalent to training a CNN
Smaller networks are not able to generalize well on the test set. Mildly complex network is required for improved performance
We tried different values for minimum number of kernels and concluded that a mildly complex architecture is required for optimal learning and performance
Smaller networks could not generalize well on test sets
Using high level features from a pre trained network at the bottleneck layer may lead to degraded performance of the baseline architecture
We used Inception V2 to extract high level features and fused them at the bottleneck layer of our model
We used this strategy with both Resnet and Unet
We found that this strategy leads to degradation of closeness score in both the architectures
This may be due to irrelevant high level features extracted by the pre-trained network
Fractional Strided Convolution performs much better than Convolution + Upsample
We compared a Unet model implemented with Fractional Strided Convolution against Convolution + Upsample
We concluded that Fractional Strided convolution performs much better than Upsample
This may be because fractional strided convolution learns the mapping parameters during the training. Therefore, the up-convolution layers can adapt themselves as per the dataset
That being said, Upsample is faster and may perform equally when we have a known up sampling strategy for the dataset
Larger Unet architecture performs better than a smaller architecture
We compared a smaller version of Unet with a larger version of the same (more layers)
We concluded that larger network performs significantly better than its scaled down version
Few focus areas which remain to be explored are -
In our study, we constrained ourselves to a smaller dataset and fewer epochs due to time and resource constraints. By training the models on a larger dataset for higher number of epochs, we can get significantly better results.
We need a better quantitative metric to compare our results. As we have explained before, Closeness score is not an appropriate measure for problems such as Image Colorization which does not have a unique solution.
Optimization of other hyperparameters such as weight initialization and learning schedule can lead to better results.
Temporal consistency can be explored for implementation of Image Colorization in Videos.