Accepted Papers

Hidden in plain sight: VLMs overlook their visual representations

Stephanie Fu, tyler bonnen, Devin Guillory, Trevor Darrell

ARGUS: Hallucination and Omission Evaluation in Video-LLMs

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, Tom Goldstein

Vision language models are unreliable at trivial spatial cognition

Sangeet S. Khemlani, Tyler Tran, Nathaniel Paul Gyory, Anthony M Harrison, Wallace Lawson, Ravenna Thielstrom, Hunter Grey Joseph Thompson, Taaren Singh, J. Gregory Trafton

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam

Highlight: Learning Visual Prompts for Vision-Language Models

Jana Ricarda Zeller, Aleksandar Shtedritski, Christian Rupprecht

RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

Savya Khosla, Sethuraman T V, Alex Schwing, Derek Hoiem

Dual Thinking and Logical Processing - Are Multi-modal LLM's Closing the Gap with Human Vision?

Kailas D, Nikhil Kumar, Anand Sinha, Brejesh Lall

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images

Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, Seon Joo Kim

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Justus Westerhoff, Erblina Purelku, Jakob Hackstein, Leo Pinetzki, Lorenz Hufe

VideoSetBench: Identifying and Reasoning Similarities and Differences in Similar Videos

Yue Qiu, Yanjun Sun, Takuma Yagi, Shusaku Egami, Natsuki Miyata, Ken Fukuda, Kensho Hara, Ryusuke Sagawa

Ask, Pose, Unite: Scaling Data Acquisition for Close Interaction Meshes with Vision Language Models

Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy

The Art of Deception: Color Visual Illusions and Diffusion Models

Alexandra Gomez-Villa, Kai Wang, C.Alejandro Parraga, Bartłomiej Twardowski, Jesus Malo, Javier Vazquez-Corral, Joost van de Weijer

A Vision Centric Remote Sensing Benchmark

Abduljaleel Adejumo, S. Faegheh Yeganli, Clifford Broni-Bediako, Aoran Xiao, Naoto Yokoya, Mennatullah Siam

ViViD - Vision Language model for Unified Visual Understanding of Documents

Adithya S Kolavi

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann, Francesco Croce, Nicolas Flammarion, Matthias Hein

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Aaryan Garg, Akash Kumar, Yogesh S Rawat

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

Akash Kumar, Zsolt Kira, Yogesh S Rawat

A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks

Vishaal Udandarao, Mehdi Cherti, Shyamgopal Karthik, Jenia Jitsev, Samuel Albanie, Matthias Bethge

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Maximilian Augustin, Yannic Neuhaus, Matthias Hein

Emergence of Text Readability in Vision Language Models

Jaeyoo Park, Sanghyuk Chun, Wonjae Kim, Sangdoo Yun, Bohyung Han

Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce

DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows

Mashrur M. Morshed, Vishnu Boddeti

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

Page updated

Google Sites

Report abuse