This project showcases our advancements in open-source multimodal machine learning models by focusing on compact architectures and improving performance on the VisualWebBench benchmark. Our work builds upon fine-tuning and optimizing LLaVA-v1.5-7B using the MultiUI dataset. Our work was conducted as part of the coursework at the Carnegie Mellon University Language Technologies Institute within the School of Computer Science (Course Website). Github Repository.

Key Achievements

Contributors: