Object Detection and Description using VLMs and LLMs

Project done as part of the course 10-623 (Generative AI) at Carnegie Mellon University (Spring 2024)

Github Link: https://github.com/YashPat22/LLM_VLM_Comparison

Kaggle Dataset 1 (Image - Object Detection): https://www.kaggle.com/datasets/yashpatawarijain/iot-components-images

Kaggle Dataset 2 (Text - Prompt Response): https://www.kaggle.com/datasets/yashpatawarijain/iot-component-images

This project aimed to leverage recent advances in computer vision and natural language processing by integrating object detection models with large language models (LLMs) and vision language models (VLMs). The key idea was to inform the language models about detected object classes in input images to improve their ability to describe and provide useful information about those objects.

The researchers created a custom dataset of images from an Internet of Things (IoT) starter kit along with textual descriptions of each component. They trained an object detection model on this dataset and then experimented with various methods to combine the detected object labels with LLMs like Mistral and VLMs like LLaVA. Techniques included zero-shot inference, in-context learning, retrieval-augmented generation, and parameter-efficient fine-tuning.

Extensive experiments revealed that simply using an object detector with an LLM or VLM produced suboptimal results. However, fine-tuning these models on the custom dataset greatly improved their ability to accurately describe the IoT components and their use cases. The best performance was achieved with a retrieval-augmented generation (RAG) approach that combined information retrieval with language generation.

The project demonstrated the effectiveness of integrating vision and language models for specialized domains like IoT component identification. Fine-tuning was crucial for adapting generic models to the niche task. The RAG method proved particularly promising by leveraging both knowledge retrieval and language generation capabilities. Future work could explore continuous RAG agents, larger non-quantized models, and expansion to other domains.

Page updated

Google Sites

Report abuse

Object Detection and Description using VLMs and LLMs

Some Additional Links