Generate anime using a DCGAN model
Explore image synthesis process that was established using GANs in the paper Generative Adversarial Networks by introducing a Deep Convolutional Generative Adversarial Network (DGCAN).
Abstract
Generative Adversarial Networks (GANs) are networks involving two neural networks, generator & discriminator, in which the gain from one is the loss of another. GANs have been implemented to accomplish tasks such as image synthesis, data augmentation tasks, and improving image compression to name a few. Examples of this can be demonstrated by NVIDIA who is using GANs to create realistic faces. Such networks are important since with them we are able to create new data with already existing data. For this project, a Deep Convolutional Generative Adversarial Network (DCGAN) will be implemented which still contains both a generator and discriminator, only the architecture will differ from that of a normal GAN. The generator will map a noise vector and map it to the size of the output & the discriminator will do the steps backward. Results showing noise would indicate a faulty network structure, meaning some part of the network needs some tweaking. For successful results we expect to see some form of anime face whether it’s fully defined or not, meaning if we are expecting to see some image and not noise. I propose a bare-bones DCGAN that will be able to generate images to generate a better understanding of image synthesis and GANs.
Data
The dataset used in this project was the Anime Faces dataset which can be found at Kaggle it contains 20,000+ 64x64 RBG images of anime faces. The dataset was chosen to make the project 'interesting' & 'fun' for lack of better words in order to train the model to generate anime faces, Japanese cartoon character faces.
Approach
The architecture used will be the DCGAN, as mentioned since the images are of low resolution/quality. The choice of this architecture stems from not only being ideal for dealing with low-resolution images but also from an implementation and modification standpoint. Since, this architecture is not as complex and of such a high caliber as a StyleGAN architecture, since there are many other aspects to this type of architecture other than a generator and discriminator. The model will contain a generator and discriminator network which will play a 'game' between the two so to speak as the generator attempts to 'fool' the discriminator and the discriminator attempts to differenciate the difference between the fake and real images.
Images generated by the generator after training about 2,000 epochs resulted in quick generation of images with some facial distortion.
The model quickly stalled as the discriminator had a floor accuracy of ~83% showing the generator was able to 'fool' the discriminator rather quickly.
Images shown are images generated by every 1000th epoch
The DCGAN model created quickly reached a peak in accuracy for the discriminator and stalled in progress for the remainder of the training. This poses the issue that the generator was able to fool the discriminator rather quickly and the image quality was not able to progress any further.
Testing Dataset
The dataset that was used to measure how effective the model is when compared to other models was the selfie2anime dataset which was used when comparing the FID score of the other models.
FID evaluation
Evaluation of the GAN model was done by the Frechet Inception Distance(FID) which determines the distance between the space vectors between the generated images and fake images. A lower FID score is wanted since the lower the distance the more similar the images. The proposed DCGAN model underperformed in the dataset that was used and also in the dataset that was used to compare to other models. This could be due to not locking the generator and training the discriminator when training stalls.
Future Work
Additional model modifications such as attempting to turn off training on the generator whilst training the discriminator needs to be explored.
Pick a broader dataset as the current dataset contains only facial features, it would be interesting to see images with backgrounds or full body image synthesis.