CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
GAN
Let's back up a little bit and talk about what a GAN is and why a CycleGAN makes sense for this. GAN stands for Generative Adversarial Network. The input to a GAN is random noise, and the output should be a realistic image. GANs are typically used to diversify the input space and create more data to improve accuracy. However, this requires you to have many images of the output space to train on, which we do not have.
CycleGAN
CycleGANs are models which train two GANs using unpaired input. All you need is a set of images from your reference class and a set of images from your target class. Since the input images are unpaired, it is completely fine to have way less images from your target class. The one GAN will learn to translate images from the reference class to the target class and the other GAN will learn to translate images from the target class to the reference class. Because we are lacking images for some emotions, but have a lot of images for other emotions, we can utilize CycleGANs to perform image to image translation to create more images.
The diagram above shows how the CycleGAN training works. Essentially, we will have reference and target class images as input. The reference class images are fed into the generator for the target class to generate fake target images and the target class images are fed into the generator for reference class images. The real and fake images for both the reference and target classes are put into their respective discriminators which tries to discern whether they are real or not. The loss that is backpropagated through the generators is a combination of cycle consistency loss (if I generate a fake target class image and use that to generate a fake reference class image, how similar is it to the input reference class image?) and GAN loss (how confident is the discriminator that I'm lying?). The loss backpropagated through the discriminator is just the mean squared error (MSE) loss.
CycleGAN Paper Architecture
This is the architecture of the CycleGAN we replicated from the paper. Every layer of this generator network is either a deconvolution or convolution, followed by batch normalization and ReLU. They mostly use 3 by 3 convolutions in the middle layers but have the input and output layers as 7 by 7 convolutions.
For the discriminator, it is a series of 4 by 4 convolutions with batch normalization and ReLU at every layer.
Generator structure
Discriminator structure
CycleGAN Approach & Evaluation
For our final CycleGAN, we used a modified architecture based off of the network architecture specified in the paper. We replaced all the batch normalization layers with instance normalization except for in the Resnet blocks because we thought it made more sense with a batch size of 1. We experimented with ReLU and logistic activations, but both of these slowed down generator learning by quite a bit, leading the model to generate mostly black images even after 50 epochs. Thus, our final model did not have any activation at output layers.
In terms of hyperparameters, we ended up mostly adhering to the ones specified in their paper. However, they included 2 learning rates but not what schedule was used to update them, so we decided to stick with only one of them. We chose to use the higher learning rate of 0.0002.
One of the biggest change we made that seemed to improve generator learning was scheduling the discriminator's loss backpropagation. Only backpropagating the loss every 5 epochs allowed the generator to generate fairly good images after training for a total of 50 epochs, whereas before it took more than 100 epochs. The final architecture we used is below:
Generator:
1 layer 7*7 convolution (stride=1, padding=3), InstanceNorm, and ReLU
2 layers of 3*3 convolution (stride=2, padding=1), InstanceNorm, and ReLU
6 Resnet blocks: 2 layers of 3*3 convolution (stride=1, padding=1), BatchNorm, and ReLU
2 layers of 3*3 deconvolution (stride=2, padding=1), InstanceNorm, and ReLU
1 layer 7*7 convolution (stride=1, padding=3)
Discriminator:
1 layer of 4*4 convolutions (stride=2, padding=1), InstanceNorm, and ReLU
3 layers of 4*4 convolutions (stride=2, padding=2), InstanceNorm, and ReLU
1 layer of 4*4 convolutions (stride=1)
Below we have included our training loss plots for the generators and discriminators. While we did keep track of loss, we mainly evaluated our CycleGAN manually by looking at the images being generated. This is because, as you can see, the discriminator loss varies widely across epochs.
Training loss for generators
Training loss for the reference class discriminator
Training loss for the target class discriminator
CycleGAN Results
Below we have included some sample disgust images which were generated by our CycleGAN. Although the images are not perfect, we can see that the model is learning some characteristics of disgust faces and attempting to apply them to neutral faces. For example, disgust faces often times have wrinkles by the nose, around the mouth, and under the eyes. They also tend to have furrowed eyebrows and a frown. You can see through the images below that the model recognizes each of these areas on neutral faces and tries to alter them to fit the disgust emotion.
Reference image
Generated image
➞
Since many faces of disgust have wrinkles under the eyes and in the cheek area, we think this is the model trying to imitate that.
➞
Here, we think this is the model trying to alter the input image to furrow the eyebrows.
➞
The lines around the mouth seem to be the model trying to either open the mouth, create wrinkles around it, or adjusting the shape so it becomes a frown.
➞
In this generated image, the model has wrinkled the nose and made the mouth more of a frown shape.