Learning with limited target data
Best Viewed at https://sites.google.com/andrew.cmu.edu/multigan-distillation (CMU access)
Code available here
Best Viewed at https://sites.google.com/andrew.cmu.edu/multigan-distillation (CMU access)
Code available here
While GANs have shown great success in achieving photo-realism when trained with a lot of data with diversity, they fail in the limited data scenario. In this series of experiments, we consider the scenario where we have a limited set of "target" data on which we aim to train a GAN. As we show empirically, training a GAN from scratch on such a limited dataset leads to issues such as mode collapse. Besides, one can not expect diversity in the generated images by training on a small set of real images.
One way to mitigate these issues is by transferring prior knowledge of the real data distribution from a pretrained "source" GAN. While fine-tuning a pretrained GAN is one solution, we observe that the generated results are of inferior quality. Thus, we aim to explore solutions beyond finetuning.
In particular, we study a special "mining" [18] technique that mines the most useful subspace of the source latent distribution to transfer to the target dataset. The overarching objective is to finetune the source GANs on the target data incorporating mining which results in superior transferability. In the next series of experiments, we motivate and study this mining solution.
In the following sections, we experiment with fine-tuning the pre-trained generators on small target datasets. Particularly, we experiment on the StyleGAN-2 based generators trained on FFHQ Faces and LSUN Cats. The architecture of StyleGAN is described in greater detail in the next topic "Learning with no target data".
We demonstrate different training strategies such as naive fine-tuning, and the use of novel mining networks that facilitate stable fine-tuning of generators in an adversarial setting. Subsequently, we also discuss strategies to fine-tune under multiple pre-trained generators.
In this section, we first study the effect of mining under the presence of a single source GAN. We identify two scenarios: on-manifold and off-manifold learning. The on-manifold setting corresponds to the scenario where the target data belongs to the same distribution as the source data (on which the source GAN was trained). For e.g., if the source GAN is trained to generate human faces, then a specific category of faces (e.g. blonde hair) would lie on the manifold of faces. An example of the off-manifold learning would be the case where the source model is trained on human faces, but the target data contains images of cats (which might have related features such as the facial structure and unrelated features such as tail).
First, we motivate why transfer learning is required and, in particular, why a mining strategy should be used. Then we show empirical results and provide our interpretation of the results. For all our experiments, we use the StyleGAN-2 as the source model.
To test the generalizability of GAN training in the limited data scenario, we first naively train a StyleGAN-2 model from scratch on a small dataset consisting of 160 Cat faces from the AFHQ dataset. Fig. 2.2 shows sample results from the trained GAN. Clearly, the training went into a mode collapse indicating that StyleGAN is not able to generalize to this limited data scenario.
Figure 2.2. Images generated by a StyleGANv2 trained from scratch on 160 AFHQ Cat faces. FID score: 91.55.
Training a GAN from scratch is a very demanding process. Instead, one would prefer to leverage pre-trained GANs and perhaps finetune on the available target dataset. We explore the following two scenarios: on-manifold learning & off-manifold learning depending on the available target data.
This setting assumes that there is a significant overlap between the (source) distribution of the pre-trained GAN and that of the (limited) target dataset. To construct this scenario, we select a small subset of blond face images from CelebA dataset (Blonde-1k) and use that to fine-tune a StyleGAN-2 generator pre-trained on the FFHQ faces. Note that here blond faces can be considered as a subset of the FFHQ faces.
Figure 2.3.1 illustrates the results of fine-tuning this model on Blonde-1k data. We can clearly see that the model is able to capture some low-level facial details, but it lacks photo-realism.
Figure 2.3.1: Images generated by the StyleGAN2 model fine-tuned on Blonde-1k data. FID score: 41.31.
In the off-manifold setting, there is an almost negligible overlap between the source generator and the target data distribution. For example, the source generator was trained on face images and we want to finetune it to generate Cat Faces. To test this setting, we fine-tune the StyleGAN2 faces generator onto a small subset (160 images) of cat images from AFHQ Cats. The generated results are visualized in Fig. 2.3.2 below. Note that although the model generates cat images, there is limited diversity.
Figure 2.3.2: Generated results on finetuning the StyleGAN2-faces model on AFHQ-cat data (Off-manifold setting). FID score: 54.85.
Even in the on-manifold case, the target distribution might differ from the source GAN distribution. For instance, in the case of generating blond faces, the target faces correspond to a very tiny subspace in the FFHQ faces distribution. Thus, to better approximate the target distribution, we adopt the mining strategy from MineGAN [18].
Fig. 2.4 describes the approach in detail. The mining operation (Miner) is implemented as a small MLP M that transforms the input latent vector to a more suitable prior that describes the regions closely aligning with the target distribution. In particular, we sample a random vector u ~ N(0,1), map it to the relevant regions of the generator's input space: z = M(u), and then the image is generated as G(z).
Effectively, in the presence of the miner network, the generated image is G(M(u)) as opposed to G(z) when trained without the miner. As shown in Fig. 2.4b, this enables generating a subset of the source data that most closely resembles the target dataset. In this manner, the network receives a better initialization to be fine-tuned on the available target data.
Figure 2.4: Learning a miner network. (a) A small MLP M is employed before the generator that mines the prior distribution which is most promising w.r.t the target data. Here, the source GAN is trained on the FFHQ faces dataset, while the target data is chosen as the 1k-Blond faces subset of CelebA dataset. The miner network learns a transformation from the input u-space to a subset of the z-space. (b) A characteristic feature of the miner network is that it enables mining for the subset of the source dataset that closely resembles the target dataset.
As described above, we train the miner setup on Blonde-1k data with a StyleGAN2 model pre-trained on the FFHQ Faces dataset. The generated results are visualized in Fig. 2.4.1 below. We can visually compare the below image with the fine-tuning experiment (Fig. 2.3.1) and see that the miner network is able to capture more diversity in Blonde faces and generates more photo-realistic results.
Figure 2.4.1: Generated results on learning with the Miner Strategy on Blonde-1k data (On-manifold setting). FID score: 31.18.
To test the miner network's performance in the off-manifold setting, we fine-tune the FFHQ Faces trained GAN on the 160 cat faces from AFHQ Cats dataset. The outputs are visualized in Fig 2.4.2 below. Clearly, these outputs exhibit more diversity than those in Fig. 2.3.2 corroborating the benefit of the miner network.
Figure 2.4.2: Generated results on learning with the Miner Strategy on AFHQ-Cat data (Off-manifold setting). FID score: 59.23.
Table 2.5.1 compares the FID scores of the experiments performed in Sec. 2. Based on these results, we can see the model trained with the miner network achieves a lower FID score in the on-manifold setting. For the off-manifold setting, while it has a slightly higher FID as compared to naive fine-tuning, the qualitative results are still promising (reasonably diverse images capturing the target distribution).
We believe that the slight discrepancy in the trend for the off-manifold setting is due to the presence of a mapping network in StyleGAN-based architectures. Specifically, the mapping network in StyleGAN could mitigate the explicit need for a miner. Besides, treating StyleGAN as a black-box generator for downstream tasks may not always practical as we discuss more in the next topic (Learning with no target data).
Table 2.5.1: Quantitative comparison of metrics on StyleGAN-2
We now extend the mining idea to the case of multiple source GANs. As we saw in Sec. 2.4, finetuning with the mining strategy improves the quality of results as compared to training from scratch. We attribute the success to the transfer of prior knowledge of real data. Therefore, it makes sense to employ mining with a more diverse collection of datasets in the form of multiple pre-trained source models.
Often, the mere presence of multiple datasets does not necessarily entail an improved performance. For instance, in many Domain Adaptation and Transfer Learning scenarios [5, 6], one observes negative transfer which implies a performance degradation in comparison to naive baselines. Here, since we will have multiple sources of data, the key challenge would be to selectively transfer the useful information from the available source models and discard the information that hampers the generative quality.
In addition to the mining strategy, here we implement a selection mechanism inspired by [18]. The core idea is to selectively prefer those source generators that have a distribution similar to the target data during training, in the hope that it would prevent negative transfer.
In the current training paradigm, we choose to finetune the source GAN to the target dataset. As described in Sec. 3.1, we have a choice of selecting the most useful generator for the target data. We now describe a selector strategy inspired by [18] that achieves this objective.
Intuitively, a source generator would be useful for the target task if it generates images that resemble the target distribution. The discriminator being trained would assign a high realness score to such generated images (since the real distribution observed by the discriminator is the target dataset). Conversely, one can argue that fake examples that receive a high realness score are the most difficult ("informative") examples that ensure the learning of a good discriminator. From this intuition, one can argue that the generator that yields images with a high realness score would be most conducive to learning the target distribution.
The selection strategy is based on this intuition, and is depicted in Fig. 3.2. Specifically, given two generators G_src1 and G_src2 we first generate fake images, i.e. obtain x_1 = G_src1(M(z)) and x_2 = G_src2(M(z)). These are then passed to the discriminator to obtain the realness score y_1 = D_tgt(x_1) and y_2 = D_tgt(x_2). We select that generator that yields a higher realness score (e.g. we choose generator 1 if y_1 > y_2). These images are used for backpropagating the adversarial loss. Note that, eventually, both the GANs are fine-tuned. During training, we also maintain an estimate of the selection probability of each generator. This estimate would be used during inference to probabilistically sample a generator and subsequently sample an image from the selected generator.
Fig. 3.2 illustrates a sample scenario where we learn a selector to generate faces with blond hair, given two source generators for human and cat faces. One would find the human faces generator to be more apt for this task, and therefore the selector prioritizes sampling from this generator. Note that we still retain the miner network which mines for the most salient regions in the input latent space.
Figure 3.2. We implement a selection strategy for mining the most informative source generator. The images with the argmax realness score are selected for backpropagating the adversarial loss.
To test the feasibility of the above approach, we conduct an elementary analysis of using this selector strategy to learn our target distribution using FFHQ Faces (source 1) & LSUN Cats (source 2) generator models.
In our first two experiments, we learn a selector to fit on two small subsets of Celeb-A dataset comprising people with Blond hair (Blonde-1k) and people with eyeglasses (Eyeglasses-1k) respectively. Fig. 3.3a,c show the model predictions on training on the target distributions Blonde-1k and Eyeglasses-1k respectively. We observe high fidelity generation of the expected target distribution. Furthermore, we observe (Fig. 3.3b,d) that during training, the selector learns to choose the first source generator (FFHQ Faces) that is evidently closer to the target data compared to Cats.
Figure 3.3 (a): Generated results on learning with the MultiGAN Strategy on Blonde-1k data. FID score: 35.20
Figure 3.3 (b): Plots showing the selector probabilities of the two generators based on target distribution.
Figure 3.3 (c): Generated results on learning with the MultiGAN Strategy on Eyeglasses-1k data. FID score: 35.45
Figure 3.3 (d): Plots showing the selector probabilities of the two generators based on target distribution.
Similarly, we train our model to fit on a small target subset of cat images from the LSUN dataset (Fig. 3.3e). We find that in this setting the selector learns to choose the second generator (StyleGAN2-Cats) that clearly resembles the target distribution. While the generated images are reasonable, we expected better images since here the target data is a subset of one of the source datasets. However, we do observe slightly degraded images which we hypothesize to be due to negative transfer between the two distributions (FFHQ Faces and LSUN Cats). This is numerically verified in the next section.
Figure 3.3 (e): Generated results on learning with the MultiGAN Strategy on 1000 images from LSUN-Cats data. FID score: 63.22
Figure 3.3 (f): Plots showing the selector probabilities of the two generators based on target distribution.
Now we compare the models trained above with some quantitative results (FID score). We have three models:
finetune: the pre-trained model is simply fine-tuned to the target dataset without any mining strategy as in Sec. 2.3.
with miner: the pre-trained model is augmented with a miner network and then fine-tuned as described in Sec. 2.4.
MultiGAN: the extension to multiple source GANs including the selection strategy as described in Sec. 3.2.
Table 3.4 summarizes the FID score of various models where the source and target data are mentioned. We make the following observations:
Rows 1 & 2: The model trained with miner outperforms the vanilla fine-tuned network by a large margin (31.18 vs 41.31).
Rows 2 & 3 and Rows 4 & 5: The MultiGAN training performs slightly worse than the single-source GAN scenario. In particular, we observe that incorporating the LSUN Cats trained model leads to negative transfer. Nevertheless, the qualitative results seen in Sec. 3 are still promising.
While the selector mechanism correctly selects the right generator in the MultiGAN setup, we find that the FID score is slightly worse than using a single GAN. We suspect this is due to the negative transfer as mentioned in Sec. 3.1. As a future work, we would like to explore strategies to prevent negative transfer.
Table 3.4: Quantitative comparison of different models trained in the sections above.
[1] Hinton et al, “Distilling the Knowledge in a Neural Network”, NeurIPS Deep Learning and Representation Learning Workshop (2015).
[2] Karras et al, “Analyzing and Improving the Image Quality of StyleGAN”, arXiv:1912.04958 (2020).
[3] Addepalli et al, “DeGAN: Data-Enriching GAN for Retrieving Representative Samples”, AAAI (2020).
[4] Kurmi et al, “Domain Impression: A Source Data Free Domain Adaptation Method”, WACV (2021).
[5] Kundu et al, “Universal Source-Free Domain Adaptation”, CVPR (2020).
[6] Kundu et al, “Towards Inheritable Models for Open-Set Domain Adaptation”, CVPR (2020).
[7] Wang et al, “Adversarial Learning of Portable Student Networks”, AAAI (2018).
[8] Chen et al, “Distilling Portable Generative Adversarial Networks for Image Translation”, AAAI (2020).
[9] Wang et al, “KDGAN: Knowledge Distillation with Generative Adversarial Networks”, NeuRIPS (2018).
[10] Chang et al, “TinyGAN: Distilling BigGAN for Conditional Image Generation”, ACCV (2020).
[11] Aguinaldo et al, “Compressing GANs using Knowledge Distillation”, arXiv:1902.00159 (2019).
[12] Isola et al, “Image-to-Image Translation with Conditional Adversarial Nets”, CVPR (2017).
[13] Lin et al, “Anycost GANs for Interactive Image Synthesis and Editing”, CVPR (2021).
[14] Sankaranarayanan et al, “Generate To Adapt: Aligning Domains using Generative Adversarial Networks”, CVPR (2018).
[15] Li et al, “GAN Compression: Efficient Architectures for Interactive Conditional GANs”, CVPR (2020).
[16] Li et al, "Semantic relation preserving knowledge distillation for image-to-image translation", ECCV (2020).
[17] Lopes et al, “Data-Free Knowledge Distillation for Deep Neural Networks”, NeurIPS Workshop on Learning with Limited Data (2017).
[18] Wang et al, "MineGAN: effective knowledge transfer from GANs to target domains with few images", CVPR (2020)