Our style transfer approach consists of two steps, both of which have closed form solutions and can be computed efficiently. The first step is the WCT transform, which stylizes the content image based on feature projections. As mentioned earlier, WCT however was designed for artistic style transfer. We employ a novel decoder transform to retain better image reconstruction. The WCT step alone still generates several structural artifacts which makes the resulting image not photorealistic. We resolve this issue by training a spatial propagation network that serves as a distortion filter. Our resulting images show lesser artifacts and comparable results compared to previous work, while still being advantageous in other domains - faster, universal transfer, and not relying on costly post-processing steps.
A simple decoder trained using max-pooling layers served the purpose of artistic transfer well for WCT. PhotoWCT went one step beyond by employing a upsampling layer instead of a 2D unpooling. They save the pixel ID chosen for max-pooling and pass this to the upsampling mask. This however produced several distortions in the resulting image with several "blurry" patches which needed to be corrected by a costly smoothing step.
We employ the novel local importance pooling (LIP) in our encoders, while keeping the unpooling layer in the decoder. LIP is an effective pooling layer based on local importance modeling. LIP can automatically enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. By using a learnable network, the importance function now is not limited in hand-crafted forms and able to learn the criterion for the discriminativeness of features. The window size of LIP is restricted to be not less than stride to fully utilize the feature map and avoid the issue of fixed interval sampling scheme. More specifically, the importance function in LIP is implemented by a tiny fully convolutional network which learns to produce the importance map.
We incorporate LIP from the previous step and replace the max-pooling layers in all encoders (1-4) of the VGG_19 network with the new pooling layer. We use the entire MSCOCO-17 dataset of unlabeled images consisting of about 122,000 images for training purposes.
We used pre-trained models and weights to initialized the weights of each encoder/decoder pair. During training the decoder, the weights of the encoder are fixed. Our loss module consists of a sum of pixel-reconstruction loss and feature loss. Pixel-reconstruction loss is nothing but the MSE loss of the resulting image and the original image. Feature loss is the MSE loss between the features extracted by the encoder for the resulting image and the content image. Lambda is set to 1 in the loss module.
Decoder2 and Decoder3 took about 4 and 6 hours respectively to train on a K-80 NVIDIA GPU. Decoder4 took 48 hours to train on this GPU. Batch size was set to 2 since we ran out of GPU memory for batch sizes greater than 2. To compensate this, we set the max epochs to 2.
Despite using the novel LIP pooling, the stylized results still contain significant artifacts due to the distortion caused by the deep autoencoders. To resolve this issue, we exploit the spatial propagation network (SPN) proposed by Liu et al. The SPN is a framework that learns to model pixel pairwise relations. SPN can serve as an “anti-distortion filter” and it results in the stylized image having close affinity with the content image. The coefficients of the filter are learned through a CNN guidance network in a data-driven manner. We train a SPN using the reconstructed and whitened content images. The content image is passed to the encoder. Then we apply the SPN on the stylized image to minimize distortions caused by the auto-encoder. We use the whitened content image instead of the original content because we do not want any color information. Affinity learned directly from the original content image will include color information and affect the stylized image.
The architecture of our approach remains similar to multi-level stylization approach used in the original WCT approach. However, instead of using 5 Relu layers of the VGG_19 network as in the original WCT approach, we used only the first 4 Relu layers. This is because we found lesser artifacts in many images when using 4 layers instead of 5, while transferring style to all aspects of the content image. This also removed an extra costly WCT step (SVD transformation).
While WCT2 proved highly effective at reconstructing content images, one of its major flaws is it's heavy reliance on segmentation maps. These segmentation maps either need to be designed manually using appropriate software, or can be made though the use of another computationally expensive autoencoder trained for the purpose. This makes makes WCT2 ill-suited for large scale image stylization.
To address this problem we designed a heuristic approach to automate the design of segmentation maps without the need for a deep neural network.We did this using a modified version of K-means clustering in HSV color space to label segments of content and style images. Next we used color and location information for labeled segments to match similar segments from the style image to each segment in the content image. This approach allows the algorithm to dynamically determine how many segments each image should have.
Importantly every segment in the content segmentation map, should correspond to a segment in the style segmentation map. However, elements in the style image that do not correspond to elements in the content image should not be reflected in the content segmentation map.
Content
Content Segmentation Map
Style
Style Segmentation Map
Content
Style
Both content and style images were first transformed into HSV color space, which identifies pixel color by Hue, Saturation and Value (brightness) rather than RGB values. This is preferable because it allows pixels to be labeled based on hue and saturation, with less value placed on differences in brightness. The HSV images are then segmented using standard k-means clustering (k values between 5 and 10 work effectively) .
After clustering a correction step was applied to combine segments that were too close in Hue, and to address the fact that the Hue value in HSV space is cyclical.
As the images here demonstrate, the content and style segments will not necessarily match at this point
Content
Style
Each labeled segment in the content image was then matched to the best matching in the style image, in terms of segment location in the image and color of the segment.
Finally both segmentation maps were smoothed using a bilateral filter to maintain edges of image elements, while removing artifacts.
As is shown here, at this point, all segments in the content map should match segments in the style map.