DinoV2-SigLIP-Phi3 VLM

Building a custom Vision-Language Model for Image Captioning and Visual Question-Answering (VQA) tasks.

Vision Encoder - DinoV2 + SigLIP
Language Model - Phi3 with LoRA fine-tuning
Train Dataset - LLAVA-665k + LRV-Instruct

Instruct Phi2

Supervised Fine Tuning (SFT) of the Phi2 base model on the MosaicML dolly_hhrlhf dataset.

Patch Aligned Contrastive Learning

PyTorch implementation of the paper Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning, published in CVPR-23' by Meta AI.

DinoV2 Distill BERT CLIP

An oversimplified implementation of the CLIP model without the bells and whistles.
Vision Backbone = DinoV2
Text Backbone = Distill BERT

Multimodal Bottleneck Transformer (MBT)

A simple PyTorch implementation of the paper Attention Bottlenecks for Multimodal Fusion.
Audio backbone - AST pre-trained on AudioSet
Visual backbone - ViT-B16 pre-trained on ImageNet21k.
Perform parameter-efficient tuning with AdaptFormer

Page updated

Google Sites

Report abuse