Conclusion and Future Work

Conclusions

Augmenting with generative models performs significantly better compared to just applying transformations on existing images. This could probably be because the combinations of transformations do not capture the whole image manifold of the dataset.
Augmenting the dataset with generated images reduces the tendency for overfitting. This applies to both generations with and without context.
Generations with context lead to higher overall performance because they capture more information than just the label of the objects within the scene.
Adding just text information pertaining to new domains [PASCAL VOC] leads to a massive boost in performance compared to the zero-shot performance of models which were trained using information contained within the original data [COCO].

Use approximate nearest neighbors and learn a mapping across multiple domains using a language model.
- This can be done by first finding the corresponding embeddings using a language model such as sentence transformers.
- We then use a framework such as Facebook AI Similarity Search to find the captions with the most similar embeddings.
- Use a combination of sentence pairs as input for training a translation model that goes from one domain to another.
Use a paraphrase network that receives feedback based on the degree of generalization of the images generated.
- Generate a batch of images based on text prompts.
- Train the model through the batch and calculate the increase in accuracy by passing this batch.
- Feed the increase in accuracy back to the paraphrase model.
Curriculum-based approaches for gradual domain generalization.
- Start with tasks similar to that of the original domain. One way is to check if the paraphrases lie within some distance of the metric space.
- Within this region of space generate images to augment the dataset.
- Repeat the previous two steps, while moving further and further away from the original data points.

Page updated

Report abuse