CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
Abstract
Facial Expression Recognition can be a challenging task when a dataset has inadequate data and imbalanced label distribution. In this project we try to use CycleGAN as a data augmentation method to generate synthesized images for less popular emotions. In that way we can supplement the training set for our classification task. By generating more image samples for the disgust class, we were able to increase the accuracy of this class by almost 10%.
Project Demo Video
Problem
Detecting emotion from a person’s facial expression and analyzing it is crucial because of its applications including, but not limited to, lie detectors, human-computer collaboration, data-driven animation, human-robot communication etc. Since it is a hot topic in computer vision, a lot of research has been conducted to develop a facial expression recognition (FER) system. These systems enable us to classify six basic emotions from image data: anger, disgust, fear, happiness, sadness and surprise.
The problem of detecting facial expressions seems like it can be easily solved by using convolutional neural networks. In fact, convolutional neural networks have been used in facial expression recognition research for awhile, but there is inconsistency and fluctuation in recognition rate among classes. Most research done in this area has a lower recognition rate for detecting emotions like disgust and fear due to limited and imbalanced datasets. One possible solution is to use data augmentation for generating synthesized images to supplement training set in image classification.
Six basic expressions drawing
Dataset
We are using the FER2013 dataset from Kaggle's Facial Expression Recognition challenge. The data consist of 48x48 pixel grayscale images of faces labeled with 7 emotions. The training set consists of about 29k examples. You can see here that there are some emotions with much less data than others such as disgust.
Class distribution:
Angry: 4593 images
Disgust: 547 images
Fear: 5121 images
Happy: 8989 images
Sad: 6077 images
Surprise: 4002 images
Neutral: 6198 images
Six facial expressions example images
Related Work & Approach
Current research in Facial Expression Recognition focuses on recognizing seven basic and universal emotions from human face. The imbalanced distibution among emotion classes leads to low accuracy in classes with fewer samples. To deal with this issue, many methods have been proposed, such as undersampling, synthesizing minorities, creating box around minorities [5] and etc.
Undersampling is a popular method in dealing with class-imbalance problems, which uses only a subset of the majority class but the main deficiency is that many majority class examples are ignored. Another study by Nitesh Chawla et al. [4] suggests that combining undersampling of majority class with oversampling the minority can achieve better performance than only using undersampling.
Another method of generating synthesized data for minority classes is to use a image-to-image transition GAN. Xinyue Zhu et al. use CycleGAN to generate images in a target class using images in neutral class as reference [1]. The generated images were then used to augment and balance the dataset, which increased accuracy for that emotion and also for the dataset as a whole. In this study we replicated in this project.
We used the Facial Expression Recognition (FER2013) Database to evaluate our results. We first trained a CNN on the dataset, then augmented it with generated images from our CycleGAN, retrained, and compared the accuracy. In addition to looking at overall accuracy, we also looked at the per class accuracy to see if the generated images actually improved the accuracy of our target class.
GAN
Let's back up a little bit and talk about what a GAN is and why a CycleGAN makes sense for this. GAN stands for Generative Adversarial Network. The input to a GAN is random noise, and the output should be a realistic image. GANs are typically used to diversify the input space and create more data to improve accuracy. However, this requires you to have many images of the output space to train on, which we do not have.
CycleGAN
CycleGANs are models which train two GANs using unpaired input. All you need is a set of images from your reference class and a set of images from your target class. Since the input images are unpaired, it is completely fine to have way less images from your target class. The one GAN will learn to translate images from the reference class to the target class and the other GAN will learn to translate images from the target class to the reference class. Because we are lacking images for some emotions, but have a lot of images for other emotions, we can utilize CycleGANs to perform image to image translation to create more images.
CycleGAN training diagram
The diagram above shows how the CycleGAN training works. Essentially, we will have reference and target class images as input. The reference class images are fed into the generator for the target class to generate fake target images and the target class images are fed into the generator for reference class images. The real and fake images for both the reference and target classes are put into their respective discriminators which tries to discern whether they are real or not. The loss that is backpropagated through the generators is a combination of cycle consistency loss (if I generate a fake target class image and use that to generate a fake reference class image, how similar is it to the input reference class image?) and GAN loss (how confident is the discriminator that I'm lying?). The loss backpropagated through the discriminator is just the mean squared error (MSE) loss.
CycleGAN Paper Architecture
This is the architecture of the CycleGAN we replicated from the paper. Every layer of this generator network is either a deconvolution or convolution, followed by batch normalization and ReLU. They mostly use 3 by 3 convolutions in the middle layers but have the input and output layers as 7 by 7 convolutions.
For the discriminator, it is a series of 4 by 4 convolutions with batch normalization and ReLU at every layer.
Generator structure
Discriminator structure
CycleGAN Approach & Evaluation
For our final CycleGAN, we used a modified architecture based off of the network architecture specified in the paper. We replaced all the batch normalization layers with instance normalization except for in the Resnet blocks because we thought it made more sense with a batch size of 1. We experimented with ReLU and logistic activations, but both of these slowed down generator learning by quite a bit, leading the model to generate mostly black images even after 50 epochs. Thus, our final model did not have any activation at output layers.
In terms of hyperparameters, we ended up mostly adhering to the ones specified in their paper. However, they included 2 learning rates but not what schedule was used to update them, so we decided to stick with only one of them. We chose to use the higher learning rate of 0.0002.
One of the biggest change we made that seemed to improve generator learning was scheduling the discriminator's loss backpropagation. Only backpropagating the loss every 5 epochs allowed the generator to generate fairly good images after training for a total of 50 epochs, whereas before it took more than 100 epochs. The final architecture we used is below:
Generator:
1 layer 7*7 convolution (stride=1, padding=3), InstanceNorm, and ReLU
2 layers of 3*3 convolution (stride=2, padding=1), InstanceNorm, and ReLU
6 Resnet blocks: 2 layers of 3*3 convolution (stride=1, padding=1), BatchNorm, and ReLU
2 layers of 3*3 deconvolution (stride=2, padding=1), InstanceNorm, and ReLU
1 layer 7*7 convolution (stride=1, padding=3)
Discriminator:
1 layer of 4*4 convolutions (stride=2, padding=1), InstanceNorm, and ReLU
3 layers of 4*4 convolutions (stride=2, padding=2), InstanceNorm, and ReLU
1 layer of 4*4 convolutions (stride=1)
Below we have included our training loss plots for the generators and discriminators. While we did keep track of loss, we mainly evaluated our CycleGAN manually by looking at the images being generated. This is because, as you can see, the discriminator loss varies widely across epochs.
Training loss for generators
Training loss for the reference class discriminator
Training loss for the target class discriminator
CycleGAN Results
Below we have included some sample disgust images which were generated by our CycleGAN. Although the images are not perfect, we can see that the model is learning some characteristics of disgust faces and attempting to apply them to neutral faces. For example, disgust faces often times have wrinkles by the nose, around the mouth, and under the eyes. They also tend to have furrowed eyebrows and a frown. You can see through the images below that the model recognizes each of these areas on neutral faces and tries to alter them to fit the disgust emotion.
Reference image
Generated image
➞
Since many faces of disgust have wrinkles under the eyes and in the cheek area, we think this is the model trying to imitate that.
➞
Here, we think this is the model trying to alter the input image to furrow the eyebrows.
➞
The lines around the mouth seem to be the model trying to either open the mouth, create wrinkles around it, or adjusting the shape so it becomes a frown.
➞
In this generated image, the model has wrinkled the nose and made the mouth more of a frown shape.
CNN Approach
For the emotion classification task we implemented the CNN suggested by the paper. It consists of two convolution layers and 2 fully connected layers followed by a softmax for classification. Each convolution layer is followed by a max pooling layer, normalization and ReLU activation function.
This table contains detailed layers’ configurations. Accuracy of this model was 56% which is much less than the accuracy reported by the paper so we tried modifying the papers network by adding one more convolution layer. We also added some dropouts and ReLUs among the fully connected layers. This increased the accuracy up to 60%.
CNN Structure
CNN Evaluation & Results
On the left you can see the plots and a table of accuracy per class for the CNN model before data augmentation and on the right same information for CNN after data augmentation. As shown in the left table, the CNN model is performing poorly on detecting the disgust expression as it’s only detecting 2 out of 177 images. We also can see a lot of fluctuations in the test loss and test accuracy plots. On the other hand, after augmenting the disgust class, we observed that training loss drops faster than before and we experience less fluctuation in the test loss and test accuracy plots. In the paper, they mention that they were able to get a 5-10% accuracy on the whole dataset after augmentation. While we were not able to replicate these results since overall accuracy stayed almost the same after augmentation, the accuracy in disgust class improved by 10 percent.
Before Data Augmentation
After Data Augmentation
Discussion
There were many challenges we encountered while doing this project. Some challenges were due to the inconsistencies in the paper. For example, the CNN architecture, hyperparameters, and training set size they used resulted in a much worse test accuracy for us than reported in the paper. Although it should be noted that this may have also been due to their confusing notation for CNNs, which caused us to have to make some assumptions. There was also no padding information at all for any of the networks so we had to make some educated guesses there as well. For both the generator and the discriminator, lack of padding information resulted in some incorrect output sizes at the beginning. At first, we used a CycleGAN model that matched the architecture given in the paper as closely as possible. For this architecture, we trained for 50 epochs and the model was still generating pixelated, mostly black images.
We were also very confused by their use of batch normalization and ReLU at the output layers for the generator and discriminator. Since our output space was 0 to 255, it did not make sense to us to normalize the output. We were similarly confused about the output layer of our discriminator. It was unclear why batch normalization was even chosen, as the batch size specified in the paper for the CycleGAN was 1. To make sure our output was in the right space, we tried removing any normalization at output layers. This resulted in our model learning to generate clearer images much faster, and the generated images now looked more realistic. The original paper used ReLU but we also experimented with logistic activation and no activation. In the end, we decided to use no activation because the other activations resulted in slower learning.
Beyond the paper, we also struggled with GANs themselves. GANs are notorious for being hard to verify automatically, so we had to manually look at the generated images to determine when to stop training. Loss will jump all over the place as the generator tries to fool the discriminator and the discriminator tries to see past its tricks, which makes it a poor measurement for how good our model is. One of the biggest issues we faced after getting our generator to use the right output space was that the discriminator was learning too quickly. This resulted in low discriminator loss, which meant that updates to the generator were small, slowing down the generator learning. To mitigate this, we tried using transformations to augment the data and delaying/lessening discriminator learning by only backpropagating the loss every 5 epochs. These modifications seemed to help prevent the model from simply copying the input and generate more diverse output.
Future Work
In this project, we learned how important it was to have clear architecture and hyperparameters reported. Without it, we struggled to replicate some of the paper results and had to make many educated guesses. We also learned a lot about CycleGANs, how to implement them, and deal with any of the unique issues that come with using one. Given more time, we would have liked to train for longer, augment a different emotion, and train on a different dataset. We reached the limit on the Colab GPU usage and were not able to train for as many epochs as we wanted. The above images were generated from models which trained for less than 100 epochs, but most literature we read about CycleGANs mentioned training for at least 200 epochs. We would have also liked to try augmenting a different emotion, such as fear, which also has low accuracy on the FER2013 dataset. In the original paper, the authors also tried using the CycleGAN to augment other datasets such as JAFFE. If we had more time, it would have been interesting to see how well the model generalizes to different datasets, and what other unique challenges we would have to tackle.
Code
Try running our code and experimenting with the CycleGAN yourself. You can find instructions and a video demontrating how to run the code here: https://github.com/1MmM1/CycleGAN_for_FER
References
[1] Xinyue Zhu et al. “Emotion classification with data augmentation using generative adversarial networks”. In:Pacific-Asia conference on knowledge discovery and data mining. Springer. 2018, pp. 349–360
[2] Zhu, Jun-Yan, et al. "Unpaired image-to-image translation using cycle-consistent adversarial networks." Proceedings of the IEEE international conference on computer vision. 2017.
[3] Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE international conference on computer vision. 2017.
[4] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.
[5] Goh, Siong Thye, and Cynthia Rudin. "Box drawings for learning with imbalanced data." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
Image sources:
Header: https://acart.com/wp-content/uploads/2017/04/facial-recognition-img2.jpg
Six facial expressions drawing: https://miro.medium.com/max/1200/1*L93DIgkABLyRb0tX08T-MA.jpeg
Six facial expressions examples images: https://cbim.rutgers.edu/component/content/article?id=141:expression-recognition
CycleGAN training diagram: https://arxiv.org/abs/1711.00648
Generator structure: https://arxiv.org/abs/1711.00648
Discriminator structure: https://arxiv.org/abs/1711.00648
CNN structure: https://arxiv.org/abs/1711.00648