For the purposes of our experiments in this course, we use PASCAL VOC 2007 and MS COCO datasets. We work on a multi-label classification setting. We set the weight of person class to 0 in PASCAL because GLIDE (the image generation model we use) does not generate humanoid objects for ethical reasons. In MS COCO, we work on the animal superclass which contains 10 classes.
We train a ResNet-18 classifier for multi-label classification on the animal superclass of the MS COCO dataset. The animal superclass consists of 10 classes. We train the model using our method by applying either 7 or 15 augmentations per image in the train data. We compare it to other generative strategies as well as direct image-based augmentations. The general trend that we observe is that our method did not overfit on the dataset while other methods (with the exception of Mixup) overfit. Our model also performs the best amongst all the other augmentation techniques.
We compare our method to the vanilla ResNet without any augmentations. As expected, vanilla ResNet overfits to the train data. Also, augmentation using our method helps the model achieve a better mAP as compared to no augmentations. In addition, we see that increasing the number of augmentations makes it harder for the model to learn. In the following graph, we see that it takes more steps to reach the same map score compared to the model using fewer augmentations.
Figure 1: Performance of our method compared to that of Vanilla ResNet with no augmentations. Test mAP comparison on the left and test loss comparison on the right.
To compare the performance of GAN-based generations to Diffusion-based generations for our purposes, we compare our method to a conditional GAN-based augmentation using BigGAN [5]. BigGAN conditions its generation on the ImageNet classes corresponding to which the generations are to be performed. We perform a mapping from COCO class labels to ImageNet class labels for the classes of interest. In addition, unlike COCO, ImageNet does not have the giraffe and cow classes, and hence the conditional GAN experiment is run only 8 classes as opposed to 10 classes in our method. It is observed that conditional GAN-based augmentations do not perform as well as our method even though it is operating on fewer classes. It is also observed that this approach does not prevent overfitting to the same degree as our method. The Conditional GAN augmentation technique comes the closest to our model's performance.
Figure 2: Performance of our method compared to that of Vanilla ResNet with conditional GAN augmentations. Test mAP comparison on the left and test loss comparison on the right.
We also compared our method to a mixing-based augmentation method, namely Mixup [22]. Mixup generates augmentations by blending two sampled images from the dataset using a linear combination of them. The coefficient of the linear combination is sampled from a beta distribution. It is observed that, like our method, mixup does not overfit but it does not perform as well as our method.
Figure 3: Performance of our method compared to that of Vanilla ResNet with mixup augmentations. Test mAP comparison on the left and test loss comparison on the right.
We also compared our method to another non-generative mixing-based augmentation method AugMix [3]. AugMix works by performing multiple augmentations on the same image and then mixing them into one image using a linear combination of all the augmentations. From the following graph, we can see that AugMix is unable to reduce the degree of overfitting which leads to a dip in test performance.
Figure 4: Performance of our method compared to that of Vanilla ResNet with AugMix augmentations. Test mAP comparison on the left and test loss comparison on the right.
To test our hypothesis that the context of the image does help in producing augmentations that better the performance of the model, we test our method against augmentations without context. To do this, we train the ResNet using captions that just specify the class label. For example, instead of augmenting using a caption that says "a group of three cats in a park", we just use "cat" to augment. It is to be noted that this experiment is still ongoing and the results presented are preliminary. Stay tuned to the website for updates on the results. As expected, the no-context generation does not seem to be performing as well as augmentation with context. The performance of no-context generation seems to have saturated, compared to that of GLIDE with context.
Figure 5: Performance of our method compared to that of Vanilla ResNet with no-context augmentations. Test mAP comparison on the left and test loss comparison on the right.
Table 1: In-domain metric results for different models
From the above table, we can see that our model with 7 augmentations performs the best overall and it beats all the other methods by a significant margin. This is closely followed by our model with 15 augmentations, which in turn is followed by Conditional GAN. Amongst the image-based augmentations Mixup performs the best. AugMix performs almost as badly as Vanilla ResNet.
Table 2: Out of domain Average Precision results for different models and different classes -Train dataset - COCO, Test Dataset - Pascal VOC
We see that our model outperforms all the other approaches by a significant margin. Our mAP scores are more than two times the ones obtained with the vanilla model. All the other models, show performance similar to that of the vanilla model.
To test our hypothesis that the semantic augmentation technique helps the model learn, we perform an evaluation on the PASCAL test split. Here, we do not provide the images in PASCAL for training. Instead, we provide the captions corresponding to the image for training along with the image label. These captions were generated using the One-for-all [24] model. We compare the performance of this model to a model trained on MS COCO using our method and tested on PASCAL. Across all the experiments, we have used ResNet-18 as our classifier model.
The results show that just having the textual descriptions of images pertaining to the target domain gives an mAP of 0.750 after only 1 epoch, as compared to 0.472 using the model trained on MS COCO with our method.
Figure 6: Performance of our proof-of-concept method on the PASCAL VOC dataset. Test mAP on the left and test loss on the right.
As per the results, we understand that it is enough for the model to get a description of what kind of images are present in the dataset to achieve good performance. This means our generative model glide is powerful enough to generate examples across different domains. Using captions from the target domain helps our model generalize better to images that come from out-of-domain without actually having to see any images from the target dataset.
Table 3: Number of parameters used in models used
Table 4: Hyperparameters used for different experiments