The first technique that we chose was “Universal Style Transfer via Feature Transforms” (Li et al, 2017), the original WCT implementation for style transfer. This technique formulates the style transfer task as an image reconstruction process, employing the VGG-19 network as the feature extractor (encoder), and trains a symmetric decoder to invert the VGG-19 features to the original image. WCT is applied to one layer of content features such that its covariance matrix matches that of style features which is then fed into the downstream decoder to generate the stylized image. This technique is called universal/learning-free as the algorithm only requires learning the image reconstruction decoder with no style images involved. So when given a new style, we simply need to extract its feature covariance matrices and apply them to the content features via WCT. A control parameter was also present that defines the degree of style transfer. A multi-level stylization approach to match the statistics of the style at all levels (5 layers of VGG-19 network) gives stylized images of better visual quality.
As compared to other style-transfer algorithms, this technique is very efficient and easy-to-implement as there is no separate learning involved for each style image. But this algorithm was designed for artistic stylizations and did not take photo realism into consideration and hence suffers from structural artifacts and produces distortions on object boundaries. This can be clearly seen from the below stylized images produced by this technique. It was evident that the output images require some kind of smoothing step to make it photorealistic.
The second technique (Li et al, 2018) we implemented used PhotoWCT, an algorithm based on WCT which uses a novel network design to preserve photorealism. PhotoWCT uses the same network structure as WCT but is inspired by the observation that max-pooling operation in the WCT reduces spatial information in feature maps. In plain WCT implementation, as described above, we simply upsample the feature maps in the decoder to generate the stylized image. This fails to recover detailed structures of the input image. Hence, PhotoWCT replaces the upsampling layers in the WCT with unpooling layers used together with the max pooling mask as it does a much better job in preserving spatial information. This considerably reduces structural artifacts as is evident from below comparison. Finally, to reduce the effect of inconsistent stylizations for semantically similar regions, this technique does an additional smoothing step. The smoothing step which uses a graph-based ranking algorithm aims to generate consistent stylization (preserve local effect) but also takes care that the output does not deviate significantly from the original PhotoWCT output (maintain global stylization effects). This technique produces satisfactory results as far as photorealism is concerned. Some other variants include using alternative smoothing techniques in the post-processing step and incorporating smoothing in PhotoWCT implementation itself.
The post-processing step from PhotoWCT showed good results as described above. We thought of incorporating this additional smoothing step between the layers of the normal WCT. This in fact reduced several artifacts and produced a better artistic style, “blending” the style and content more naturally compared to just the WCT. We tried different variants of this by incorporating smoothing steps between all vs just one layer. Based on our results, we found that incorporating smoothing in the first two layers produced the best artistic results. Unfortunately, the photorealistic part was missing.
Finally, we ran the test images through WCT2 (Yoo et al, 2019). WCT2 (Wavelet corrected transfer) makes two key improvements on WCT.
The first is the use of Haar wavelet pooling and unpooling operations in place of the standard pooling and unpooling layers in the VGG- based autoencoder used in WCT. Haar wavelet pooling outputs for channels: {LLT LHT HLT HHT} with the low and high pass filters given as: LT = 12 [1 1] , and HT = 12 [-1 1]. The benefit of Haar wavelet pooling for photorealistic image stylization is that the pooling operation can be mirrored in the unpooling layer to exactly recover the original signal, meaning that the features of the content images can be perfectly reconstructed in the output image. The high frequency components (LH, HL, HH) of each pooling layer are passed directly to the equivalent layer of the decoder, while the low frequency (LL) component continues to pass through the entire encoder.
Secondly, WCT2 uses progressive stylization in the architecture of the autoencoder, rather than the multi-level stylization used in WCT. The progressive autoencoder architecture passes input through only one encoder-decoder network, applying WCT operations sequentially after each pooling layer. This approach improves the efficiency of the algorithm both in training and inference, because only one decoder is trained. It also avoids amplification of artifacts, which can occur in multi-level stylization.
The resulting images from WCT2 were the most photorealistic of the approaches we tested, however for best results, this method requires the additional input of segmentation maps for both content and style images. We ran a set of style/contain image pairs through WCT2 both with and without segmentation maps, so as to make a preliminary analysis of where this approach may fail.
In all conditions, WCT2 seems to fail in cases where the content image is noisy, as well as when the style image has particularly sharp local patterns (in the case of the teapot image). When segmentation maps were used, many of the test images showed artifacts at the boundaries of segments. When no segmentation maps were used, the styling would apply uniformly across the entire content image, without appropriate local styling. In this condition, the algorithm fails to transfer the style to the content image.
Overall the WCT2 approach can transfer style coloring to the content image, given appropriate segmentation maps, however it fails to transfer localized texture from the style image. We believe that the use of Haar Wavelet pooling prevents localized texture from transferring, by enforcing perfect reconstruction of the content image.
Our added automated segmentation maps allow WCT2 to perform without the need to manually design segmentation for each image using a heuristic approach. While the heuristic approach failed under some conditions, it was surprisingly effective in many of the images we tested, and did not require the use of a deep neural network.
Since our segmentation algorithm determined dynamically how many segments each image should have, there were many cases where it only selected one segment (when images had low variation in hue but some variation in value). This suggests that in future versions, the merging component of the algorithm should be more lenient in terms of the value component.
There were also a number of cases in which the algorithm matched content and style segments inappropriately. To some extent, this may be unavoidable using a heuristic approach. However, it may be improved by weighting elements of segment hue, value, and location differently, or by considering the shape, size, and compactness of the segment.
Some images require smoothing to look photorealistic. PhotoWCT smoothing is too expensive and smooths out the resultant image too much. Can incorporate smoothing in between layers.
We could only trained one SPN network for encoder4/decoder4 due to time constraints. Training SPN network on encoder5/decoder5 also ran out of GPU memory in the HTC system. Training SPN network for each encoder/decoder pair can result in images with more contrast and can preserve more features.