This project showcases our advancements in open-source multimodal machine learning models by focusing on compact architectures and improving performance on the VisualWebBench benchmark. Our work builds upon fine-tuning and optimizing LLaVA-v1.5-7B using the MultiUI dataset. Our work was conducted as part of the coursework at the Carnegie Mellon University Language Technologies Institute within the School of Computer Science (Course Website). Github Repository.
Key Achievements
State-of-the-Art Results (for our fine-tuned LLaVA model with prompt enhancements):
Achieved 77.94% accuracy in Action Prediction (Best result as of 12/15/2024).
Achieved 54.82% accuracy in Heading OCR (Best open-source/low-parameter model result).
Fine-Tuning:
Trained the LLaVA model on the MultiUI dataset, a diverse and comprehensive dataset of web-based UI interactions.
Conducted detailed visual attention analyses to understand and improve alignment between textual and visual modalities.
Prompt Engineering:
Developed task-specific prompt designs and preprocessing techniques for tasks like OCR, grounding, and captioning.
Comprehensive Research:
Our methodology advanced low-parameter multimodal models and demonstrated that training on a WebUI dataset improves benchmark performance—as far as we are able to tell, at the time this was the first time a model was trained on WebUI tasks with the hopes of improving generalizability to VisualWebBench results, rather than just using it as another benchmark.
Contributors:
Akshay Badagabettu
Nikolaj Hindsbo
Aayush Shah
Sai Yarlagadda