Building a custom Vision-Language Model for Image Captioning and Visual Question-Answering (VQA) tasks.
Vision Encoder - DinoV2 + SigLIP
Language Model - Phi3 with LoRA fine-tuning
Train Dataset - LLAVA-665k + LRV-Instruct
Supervised Fine Tuning (SFT) of the Phi2 base model on the MosaicML dolly_hhrlhf dataset.
PyTorch implementation of the paper Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning, published in CVPR-23' by Meta AI.
An oversimplified implementation of the CLIP model without the bells and whistles.
Vision Backbone = DinoV2
Text Backbone = Distill BERT
A simple PyTorch implementation of the paper Attention Bottlenecks for Multimodal Fusion.
Audio backbone - AST pre-trained on AudioSet
Visual backbone - ViT-B16 pre-trained on ImageNet21k.
Perform parameter-efficient tuning with AdaptFormer