Building a custom Vision-Language Model for Image Captioning and Visual Question-Answering (VQA) tasks.

Supervised Fine Tuning (SFT) of the Phi2 base model on the MosaicML dolly_hhrlhf dataset.