CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
CycleGAN for Facial Expression Recognition
By Michelle Lin and Fatemeh Ghezloo
Discussion
There were many challenges we encountered while doing this project. Some challenges were due to the inconsistencies in the paper. For example, the CNN architecture, hyperparameters, and training set size they used resulted in a much worse test accuracy for us than reported in the paper. Although it should be noted that this may have also been due to their confusing notation for CNNs, which caused us to have to make some assumptions. There was also no padding information at all for any of the networks so we had to make some educated guesses there as well. For both the generator and the discriminator, lack of padding information resulted in some incorrect output sizes at the beginning. At first, we used a CycleGAN model that matched the architecture given in the paper as closely as possible. For this architecture, we trained for 50 epochs and the model was still generating pixelated, mostly black images.
We were also very confused by their use of batch normalization and ReLU at the output layers for the generator and discriminator. Since our output space was 0 to 255, it did not make sense to us to normalize the output. We were similarly confused about the output layer of our discriminator. It was unclear why batch normalization was even chosen, as the batch size specified in the paper for the CycleGAN was 1. To make sure our output was in the right space, we tried removing any normalization at output layers. This resulted in our model learning to generate clearer images much faster, and the generated images now looked more realistic. The original paper used ReLU but we also experimented with logistic activation and no activation. In the end, we decided to use no activation because the other activations resulted in slower learning.
Beyond the paper, we also struggled with GANs themselves. GANs are notorious for being hard to verify automatically, so we had to manually look at the generated images to determine when to stop training. Loss will jump all over the place as the generator tries to fool the discriminator and the discriminator tries to see past its tricks, which makes it a poor measurement for how good our model is. One of the biggest issues we faced after getting our generator to use the right output space was that the discriminator was learning too quickly. This resulted in low discriminator loss, which meant that updates to the generator were small, slowing down the generator learning. To mitigate this, we tried using transformations to augment the data and delaying/lessening discriminator learning by only backpropagating the loss every 5 epochs. These modifications seemed to help prevent the model from simply copying the input and generate more diverse output.
Future Work
In this project, we learned how important it was to have clear architecture and hyperparameters reported. Without it, we struggled to replicate some of the paper results and had to make many educated guesses. We also learned a lot about CycleGANs, how to implement them, and deal with any of the unique issues that come with using one. Given more time, we would have liked to train for longer, augment a different emotion, and train on a different dataset. We reached the limit on the Colab GPU usage and were not able to train for as many epochs as we wanted. The above images were generated from models which trained for less than 100 epochs, but most literature we read about CycleGANs mentioned training for at least 200 epochs. We would have also liked to try augmenting a different emotion, such as fear, which also has low accuracy on the FER2013 dataset. In the original paper, the authors also tried using the CycleGAN to augment other datasets such as JAFFE. If we had more time, it would have been interesting to see how well the model generalizes to different datasets, and what other unique challenges we would have to tackle.