CS639 Project

Frequency Based Identity-preserving Image Translation and Editing

Mu Cai

Department of Computer Sciences, UW-Madison

Abstract

Image translation aims at synthesizing a new image which could transfer the texture from a reference image while maintaining the structure of the source image. Recently state-of-the-art models are mostly built on the autoencoder-based Generative Adversarial Networks, which rely on the spatial dimensions of two tensors to represent the structure and texture information. However, such implicit way of disentanglement is shown to be far from satisfactory in terms of the structure-preserving ability. Inspired by the frequency analysis, I decompose the original image into high/low frequency images, and further devise a spatial attention module to encourage the structure-texture decoupling process. Results on LSUN Church and CelebA-HQ dataset demonstrate the superiority of my proposed method. Furthermore, I employ my methods on the challenging image editing task, and achieve more structure-preserving result on human face images.

  1. Introduction

  1. 1 Image Translation

Given two images, image translation aims at synthesizing a new image which could transfer the style from the reference image while maintaining the content in the source image, illustrated in Fig. 1.

This technology could help us synthesize new images that don't actually exist. For example, we can synthesize a winter mountain image when if we only have a summer image for that mountain, illustrated in Fig. 1.


Figure 1: The illustration of the image translation task.

  1. 2 Image Editing

Image editing aims at editing the certain attributes of a given image. Shown in Fig. 2, given human face images, image editing models can edit the age, eyeglasses, gender and even the pose of these images.

Figure 2: The illustration of the image editing task.

2. State-of-the-art Models

Swapping Autoencoder [1] is the state-of-the-art image translation and image editing models, appeared in NeurIPS 2021. This model utilizes an autoencoder and several discriminators to generate the image hybrids. The convolution blocks are built on the well-known StyleGAN v2 model [2].

To be specific, the autoencoder is composed of the Encoder E and Generator G; while the discriminators are composed of the vanilla discriminator D and the patch discriminator D_{patch}. The overall network architecture is shown in Fig. 3, where different modules are marked with different colors.

The compressed square tensor denotes the content information, and the flat tensor denotes the style information. By swapping such components, paper claims that they can accomplish the image translation task.

Other well-known image translation and editing models, such as StarGAN [3], StyleGAN [4],and MUNIT [5] are all based on the encoder-generater structure along with some discriminators.

Figure 3: The network structure of Swapping Autoencoder [1].

3. Existing Problems

Swapping Autoencoder relied on the spatial sizes of the tensors to represent the content and style information, which is an implicit way of representation.

Therefore, it leads to unpromising results in some cases, demonstrated by the image generation results of Swapping Autoencoder [1] in Fig. 4. This problem also appears in other image translation models, such as StarGAN [3] and MUNIT [4].

Figure 4: The results of the Swapping Autoencoder [1] on LSUN Church dataset.

4. My Approach

4.1 Frequency Domain Supervision

Here I mainly focus on explicitly represent the content and style information to supervise the training process.

An intriguing observation in traditional image processing is that the high/low frequency information corresponds to the content/style information quite well, shown in Fig. 5.

Figure 5: The example of the high/low frequency images in LSUN church dataset.

In order to achieve the feature disentanglement in frequency domain, we apply the fuzzy filter to the original images.Specifically, given a fuzzy kernel k:

where [i,j] indexes the position in the image, and σ^2 denotes the variance of the Gaussian function, which grows proportionally with the Gaussian kernel size. We can get the ’blurred’ low-frequency image x_L:

The high-frequency image can be expressed as:

where r2g is short for the rgb2gray function, converting the color image to the grayscale image. In this way, the color and illumination in the low frequency image x_L are preserved while the fine details are wiped out. Meanwhile, with the help of the rgb2gray function, the high frequency image x_H almost contains the sharp edges, i.e. sketch of the original image.Then we can directly use such information to supervise the training process. To be specific, I add two loss functions in the model;

  • Frequency reconstruction loss: Restrict the reconstruction over both the high frequency and the low frequency images:

  • High frequency match loss: Restrict that the synthesized image should share the same high-frequency information, which corresponds to the structure information.

where z^1_c and z^2_s are the structure code of the source image and the texture code of the reference image, respectively.

4.2 Spatial Attention Module

Under the image manipulation tasks, the structure and texture information aims to be represented in a disentangled way.To better assist the decoupled representation, a Channel-wise Spatial Attention Module is utilized in my algorithm, shown in Fig. 6.

Here F_sq(·) squeezes the feature map into a channel-wise vector; F_ex excites the channel-wise vector into two attentionmaps; F_scale multiples the feature map by each of the attention maps, getting the content and style features. Therefore, this module can help the network to disentangle such two components.

Figure 6: Channel-wise Spatial Attention Module

5. Experiment Results

5.1 Reproducing

I first reproduce the paper method using PyTorch.

5.2 Image Translation Results

Firstly I conduct experiments in LSUN Church Dataset [5]. Shown in Fig. 7, my algorithm can achieve much better structure preserving results, demonstrated by the objects within the red box.Besides, I use Fréchet Inception Distance (FID) [6] as the quantitive comparison metric. The lower FID is, the better image quality is. By utilizing our approach, the FID of the generated images is decreased from 52.34 to 51.52. Besides, in terms of the reconstruction quality of high frequency information, the MSE loss is decrease by 17.21%.

I also conduct experiments in CelebA-HQ [7]. Shown in Fig. 8, my algorithm can achieve much better face identity preserving results.

Figure 7: Image translation results of our model on LSUN Church Dataset.

Figure 8: Image translation results on CelebA-HQ Dataset.

5.3 Image Editing Results

I conduct the experiments of the image editing task on CelebA-HQ Dataset. Specifically, I used continuous interpolation between the texture codes in two domains.

Continuous interpolation aims at creating a series of smoothly changing images between two sets of distinct images. Vector arithmetic is one commonly used way to achieve this. For example, we can sample n images from each of the two target domains, and then compute the average difference of the vectors between these two sets of images.

This mean difference vector ˆz can be viewed as the directional vector from one domain to the other. Similar to the idea in InterFaceGAN [8], we can assume that there is a hyperplane lying between the latent codes of two domains.Therefore, a model will be better at the content-style disentanglement when the vector arithmetic results over one domain does not affect the representation of the other information.After employing this algorithm on the texture code, shown in Fig. 9, my model can achieve much better face attribute preserving results than the vanilla Swapping Autoencoder and another image manipulation model StarGAN v2.

More results of season transfer on LSUN church dataset is shown in Fig.10.

Fig.9 Image editing results on CelebA-HQ Dataset: (a) StarGAN v2 (b) Swapping Autoencoder (c) My Method.

Fig.10 Image attributes editing results of the LSUN church dataset. The leftmost column denote the source images, while within the remaining columns, the more right, the semantic latent difference vector of "winter" is stonger.

6. Discussion

Conclusion:

Traditional image processing methods like frequency analysis can assist deep learning by a large margin.

Future work:

Frequency based methods can be applied in other image manipulation tasks, like image inpainting, super-resolution and so on.

Materials

If you want to read more about my approach, please look at here! [Report] [Video] [Slides] [Code]

References

[1]Taesung Park et al. “Swapping Autoencoder for Deep Image Manipulation”. In:Advances in Neural InformationProcessing Systems. 2020.
[2]Tero Karras et al. “Analyzing and Improving the Image Quality of StyleGAN”. In:Proc. CVPR. 2020.[3]Yunjey Choi et al. “Stargan v2: Diverse image synthesis for multiple domains”. In:Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition. 2020, pp. 8188–8197.
[4]Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarialnetworks”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2019, pp. 4401–4410.
[5]Fisher Yu et al. “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop”.In:arXiv preprint arXiv:1506.03365(2015).
[6]Martin Heusel et al. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium”.In:NeurIPS. 2017.
[7]Tero Karras et al. “Progressive growing of gans for improved quality, stability, and variation”. In:arXiv preprintarXiv:1710.10196(2017).
[8]Yujun Shen et al. “InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs”. In:arXiv preprint arXiv:2005.09635(2020).8