Methods

Our Approach to Implement the Project

Part I. Artistic Style Image Transfer


For first part of our project, we implement Neural algorithm with CNN that transfer our input image with an artistic style transfer

One important finding of Neural Algorithm is that the representations of content and style are separable in the CNN. As a result, we are conducting representations independently to produce a image with high perceptual image by trying to minimize the loss function. To be specific, we are going to have a regularization between the content loss of input image and style loss of the reference image.

The CNN model can extract the features of images. Its layers can track different features like the color, edges and corners in an abstract way. Therefore, we only use the layers of high layers as the input features. After having the features of input images and output images, we begin to minimize the value of loss function which is a regularization between the content loss of input image and style loss of the reference image. The two parts of the loss function is defined differently as following:

  1. The content loss is defined as a squared-error loss between the feature representations of the content input image and the result image on the selected layer of CNN. Intuitively, we want to have a result image that has similar content as the input image.
  2. In order to have style effects independent with the spatial relation, we are using Gram matrix to represent the feature correlations which is the inner product between vectorized feature map in each layer. Therefore, our style content loss is defined as the mean-squared distance between Gram matrix of input and Gram matrix of image to be made.

The very first step of our algorithm is to have a copy of content image as the output image before the first iteration. Finally, in order to generate images that mix style representation and input image's content, we minimize difference of output image image from content representation in one layer and style representation of artwork in chosen layer of CNN.

For second part of our project, we are trying to transfer our images with a realistic reference

Part II. Realistic Style Image Transfer

One limitation of Neural Algorithm is that it only transfer images successfully with an artistic reference image.

As a result, we modify our algorithm to

  • Photorealism regularization by modifying loss funtion
  • Augmented style loss with semantic segmentation(see below)

We still use the same CNN model as feature extraction and we will modify the loss funtion which is implemented as the first photorealism regualrization. The simple style loss function is updated to an Augmented style loss with semantic segmentation, and content image is same as before. The three parts of the loss function is defined differently as following:

  1. The content loss is defined as a squared-error loss between the feature representations of the content input image and the result image on the selected layer of CNN. Because we always want the result have the similar content as the input image.
  2. For the photorealism regularization, we are trying to seek a transformation of image that is locally affine in color space. To be specific, there is an affine function which maps input RGB values to their output for each output patch. We are using different affine function that allows spatial variations. This is because the set of affine combinations of RGB channels could span a set of variations whereas the edge itself can not move due to its location at the same place in all channels.
  3. The style content loss defined in the first catches the style of all the image and thus can't adapt to different semantic segmentation of input image. So we generate segmentation masks for the input and reference images (details of segmentation is described below) and the style loss function is a summation of the style loss of each segmentation part. In each segmentation part, we have a new feature representation obtained from selected layers of CNN as before, but we multiply these feature representation with the segmentation mask we are processing. By doing this, we can have a feature representation of the current segmentation mask only, so the result image will not wiggle the different parts of the content image. And we still use the corresponding Gram Matrix of the feature representation of each segmentation mask. Finally, the style content loss is defined as the mean-squared distance between two modified Gram matrix of input style reference image and output image.

This part is similar to the first part but with different loss function. And we choose the layers of CNN accordingly.

Semantic Segmentation

We generate images with semantic segmentation masks for the input and reference images for different objects to separate the content into different parts. We do this manually with PhotoShop to obtained the most accurate segmentation. But the image segmentation can also be done with the thresholding method, KNN method or CNN model. The reference of using CNN to generate the semantic segmentation is listed in reference.

Implementation details

Part I. Implementation Model Details

VGG-19 Pre-trained CNN model as feature extractor

Content Layer: conv4_2

Weight: 1

Style Layers: relu1_1, relu2_1, relu3_1, relu4_1, relu5_1

Weight: 0.2, 0.2, 0.2, 0.2, 0.2

Part II. Implementation Model Details

VGG-19 Pre-trained CNN model as feature extractor

Content Layer: conv4_2

Weight: 1

Style Layers: conv1_1, conv2_1, conv3_1, conv4_1 and conv5_1

Weight: 0.2, 0.2, 0.2, 0.2, 0.2

Γ = 10^2 , λ = 10^4

We have generate the result with different number of iterations, and find the most compelling result. Here are examples from 100 iterations to 2200 iterations.

Our results are generated on the CPU-only environment. The average running time of Part I is 60 minutes and of part II is 100 minutes.