Steps to Generate Augmented Image:
Generate text captions corresponding to the image - either by using a caption generator or by using a dataset that already has captions.
Paraphrase the caption. (Optional: Left for future work)
Use a text-to-image model to generate another image.
Send feedback to the paraphrase model, based on the ability of the model to generalize. (Optional: Left for future work)
Figure 2: Generation of an image that is out-of-domain using existing image in the dataset.
Pre-trained models for image-to-text, text-to-image, and paraphrasing have been trained on a vast amount of internet data. These models allow encapsulating the idea and the context of an image. For capturing the semantics of image data, language provides a comparatively stronger medium as compared to image-based encoders. Capturing the semantics of images as text is also natural to how humans interpret and understand visual data.
Consider the example depicted in Figure 2. Datasets generally depict camels to be present in some kind of an arid environment. Training a classification model on this dataset without any augmentation may lead to the model learning spurious correlations like predicting a camel when there is sand in the image.
Language provides an easy and natural way for augmenting the dataset with an out-of-domain image that helps the model. In our example, we just have to change the text caption to 'camel in a field' and generate a corresponding image using a text-image model. This offers an easier way to go from one domain to another in a simple manner. Doing the same operation purely in the image domain would involve perfect disentanglement of the latent space, which is very hard to achieve.
Introducing language also allows us to explore the idea of curriculum learning. The images generated initially could be close to the source domain and they could go further away from the source domain as the learning progresses. In our example, the initial generations could correspond to 'camel in a field' and they could later be 'illustration of a camel soldier'.
Another reason for using a text intermediate is to form a bridge between the limited-sized labeled dataset and the vast amounts of images present on the internet. This is made possible by the text-to-image models that have been trained on comparatively larger amounts of unlabeled data from the internet. If we could generate an image using these generative models that have the same labels as the original image, it would serve as a good augmentation to our current dataset.
Figure 3: Examples of possible augmentations for 'illustration of a camel soldier'
Diffusion models such as the one used in GLIDE [8] try to learn a reverse mapping from a noisy image to a comparatively less noisy image.
Figure 4: Graphical model of Diffusion Process
1.
As shown in the above equation, we create a Markov chain by progressively adding noise to the input image. Here α controls the magnitude of noise added at each step and x represents a latent vector of the Markov chain produced by adding noise to the original image.
2.
If the added noise is small enough, we can learn an approximate posterior to reverse the noising process.
Figure 5: Diffusion Process for CIFAR 10 Dataset
Figure 6: Data flow diagram for semantic augmentation