Hidden in plain sight: VLMs overlook their visual representations
Stephanie Fu, tyler bonnen, Devin Guillory, Trevor Darrell
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein
Vision language models are unreliable at trivial spatial cognition
Sangeet S. Khemlani, Tyler Tran, Nathaniel Paul Gyory, Anthony M Harrison, Wallace Lawson, Ravenna Thielstrom, Hunter Grey Joseph Thompson, Taaren Singh, J. Gregory Trafton
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
Mennatullah Siam
Highlight: Learning Visual Prompts for Vision-Language Models
Jana Ricarda Zeller, Aleksandar Shtedritski, Christian Rupprecht
RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
Savya Khosla, Sethuraman T V, Alex Schwing, Derek Hoiem
Dual Thinking and Logical Processing - Are Multi-modal LLM's Closing the Gap with Human Vision?
Kailas D, Nikhil Kumar, Anand Sinha, Brejesh Lall
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi
A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images
Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seon Joo Kim
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models
Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Leo Pinetzki, Lorenz Hufe
VideoSetBench: Identifying and Reasoning Similarities and Differences in Similar Videos
Yue Qiu, Yanjun Sun, Takuma Yagi, Shusaku Egami, Natsuki Miyata, Ken Fukuda, Kensho Hara, Ryusuke Sagawa
Ask, Pose, Unite: Scaling Data Acquisition for Close Interaction Meshes with Vision Language Models
Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy
The Art of Deception: Color Visual Illusions and Diffusion Models
Alexandra Gomez-Villa, Kai Wang, C.Alejandro Parraga, Bartłomiej Twardowski, Jesus Malo, Javier Vazquez-Corral, Joost van de Weijer
A Vision Centric Remote Sensing Benchmark
Abduljaleel Adejumo, S. Faegheh Yeganli, Clifford Broni-Bediako, Aoran Xiao, Naoto Yokoya, Mennatullah Siam
ViViD - Vision Language model for Unified Visual Understanding of Documents
Adithya S Kolavi
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
Aaryan Garg, Akash Kumar, Yogesh S Rawat
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
Akash Kumar, Zsolt Kira, Yogesh S Rawat
A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks
Vishaal Udandarao, Mehdi Cherti, Shyamgopal Karthik, Jenia Jitsev, Samuel Albanie, Matthias Bethge
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Maximilian Augustin, Yannic Neuhaus, Matthias Hein
Emergence of Text Readability in Vision Language Models
Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han
Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics
Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce
DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows
Mashrur M. Morshed, Vishnu Boddeti
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield