Image colorization based on vision transformer

Hint-Based Image Colorization Based on Hierarchical Vision Transformer

Hint-based image colorization is an image-to-image translation task that aims at creating a full-color image from an input luminance image when a small set of color values for some pixels are given as hints. Though traditional deep-learning-based methods have been proposed in the literature, they are based on convolution neural networks (CNNs) that have strong spatial locality due to the convolution operations. This often causes non-trivial visual artifacts in the colorization results, such as false color and color bleeding artifacts. To overcome this limitation, this study proposes a vision transformer-based colorization network. The proposed hint-based colorization network has a hierarchical vision transformer architecture in the form of an encoder-decoder structure based on transformer blocks. As the proposed method uses the transformer blocks that can learn rich long-range dependency, it can achieve visually plausible colorization results, even with a small number of color hints. Through the verification experiments, the results reveal that the proposed transformer model outperforms the conventional CNN-based models. In addition, we qualitatively analyze the effect of the long-range dependency of the transformer model on hint-based image colorization. 

Paper link: https://www.mdpi.com/1424-8220/22/19/7419 

Figure 6. Comparison of hint-based colorization. (a) Input luminance and color hints. (b) Results of a CNN (Zhang’s method [7]). (c) Results of our transformer-based colorization. (d) Ground truth. Note that the transformer-based method outperforms the CNN-based method because it considers rich long-range dependencies. Color hints have been enlarged for visual representation.

In this study, we proposed a vision transformer network with an encoder-decoder architecture for hint-based colorization. Through validation of the proposed model, we showed that the long-range dependency of the transformer can work effectively in hint-based colorization tasks. Even in regions where there were insufficient or no hints, the proposed model showed better results using color hints of similar objects in the image based on the long-range dependency. Furthermore, the experiments proved that the fewer hints, the greater the effect of the long-range dependency is. 

Figure 1. Overall architecture of the proposed hint-based colorization transformer network (HCoTnet).

As shown in Figure 1, the proposed HCoTnet is divided into three main parts: first, a patch embedding (tokenization) module for using the input data (i.e., luminance image and color hint map) in the transformer; second, the Unet-like encoder and decoder modules consisting of transformer blocks [17]; finally, a projection module for outputting the result by restoring and projecting the embedded features onto the ab  dimension of the CIELAB color space. The luminance image and color hint map are the inputs of the network, and through the HCoT network, the colorization result is output with the ab  color channels.

Figure 4. Visual comparison of hint-based colorization methods. (a) Input luminance image. (b) Results of Unet [17]. (c) Results of Iizuka [1]. (d) Results of Zhang [7]. (e) Results of ViT [11] (f) Results of the proposed HCoTnet. (g) Ground truth.

Figure 8. Visual comparison of the effect of long-range dependency on hint-based image colorization. (a) Input luminance and color hints. (b) Result of Zhang’s CNN model [7]. (c) Result of the proposed HCoTnet model. (d) Ground truth.