Accepted Papers

All of the accepted papers will be presented at the posts session. Our workshop is assigned with poster boards #31 - #66 (36 posters) in ExHall D at time 9:40 – 11:00am on Thursday June 12. The number in front of each paper is the assigned poster board number.

Full-length Papers

#31 UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Yang Jiao (Fudan University), Haibo Qiu (Meituan), Zequn Jie (Meituan), Shaoxiang Chen (Meituan), Jingjing Chen (Fudan University), Lin Ma (Meituan), Yu-Gang Jiang (Fudan University)

#32 Understanding Depth and Height Perception in Large Visual-Language Models

Shehreen Azad (University of Central Florida), Yash Jain (Microsoft Research), Rishit Garg (Indian Institute of Technology Kharagpur), Vibhav Vineet (Microsoft Research) Yogesh S Rawat (University of Central Florida)

#33 Repurposing SAM for User-Defined Semantics Aware Segmentation

Rohit Kundu (University of California, Riverside), Sudipta Paul (Samsung Research America), Arindam Dutta (University of California, Riverside), Amit Roy- Chowdhury (University of California, Riverside)

#34 PLVM: A tuning-free approach for Personalized Large Vision-Language Model

Chau Pham (University at Buffalo), Hoang Phan (New York University), David Doermann (University at Buffalo), Yunjie Tian (University at Buffalo)

#35 How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak (Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)), Muhammad Ferjad Naeem (Google), Jameel Hassan (MBZUAI), Muzammal Naseer (Khalifa University), Federico Tombari (Google), Fahad Shahbaz Khan (MBZUAI), Salman Khan (MBZUAI)

#36 An Interactive Agent Foundation Model

Zane Durante (Stanford University)

#37 Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Neha Kalibhat (University of Maryland)

Extended Abstracts

#38 PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam (University of British Columbia)

#39 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Shiyu Xia (AI Lab, Giant Network), Chang Liu (AI Lab, Giant Network), Haomin Zhang (AI Lab, Giant Network), Zihao Chen (AI Lab, Giant Network), Chaofan Ding (AI Lab, Giant Network), Xin Yue (AI Lab, Giant Network), Huizhe Chen (AI Lab, Giant Network), XINHAN DI (Deepearthgo)

#40 RADAR: Robust Anomaly Detection And Recovery for Robot Manipulation

Liu Rui (Tianjin University), Ni Fei (Tianjin University), Kou Longxin (Tianjin University), Hao Jianye (Tianjin University)

#41 HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs

Nikitha SR (MDSR Lab, Adobe Systems), Aradhya Neeraj Mathur (MDSR Lab, Adobe Systems ), Tarun Ram Menta (MDSR Lab, Adobe Systems ), Rishabh Jain (MDSR Lab, Adobe Systems), Mausoom Sarkar (MDSR Lab, Adobe Systems)

#42 FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Christian Schlarmann (University of Tübingen), Francesco Croce (EPFL), Nicolas Flammarion (EPFL), Matthias Hein (University of Tübingen)

#43 From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities

Dominick Reilly (University of North Carolina at Charlotte), Manish Kumar Govind (University of North Carolina at Charlotte), Le Xue (Salesforce AI Research), Srijan Das ( University of North Carolina at Charlotte)

#44 Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit

Kartheek Kumar Reddy Nareddy (German Aerospace Center), Sarah Ternus (German Aerospace Center), Julia Niebling (German Aerospace Center)

#45 HD-VILA-Caption: A Diverse Video-Text Dataset Derived from ASR Narrations

Maheen Saleh (MPI-Informatics Saarbrucken), Nina Shvetsova (Goethe University Frankfurt), Anna Kukleva (MPI-Informatics Saarbrucken), Hilde Kuehne (Goethe University Frankfurt), Bernt Schiele (MPI-Informatics Saarbrucken)

#46 RePOPE: Impact of annotation errors on the POPE benchmark

Yannic Neuhaus (University of Tübingen), Matthias Hein (University of Tübingen)

#47 HEFT: Multi-modal Personalized Federated Learning for Heterogeneous Clients

MinHyuk Seo (KU Leuven), Taeheon Kim (Seoul National University), Hankook Lee (Sungkyunkwan University), Jonghyun Choi (Seoul National University), Tinne Tuytelaars (KU Leuven)

#48 Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Jan Ackermann (ETH Zurich), Kiyohiro Nakayama (Stanford University), Tong Wu (Stanford University), Guandao Yang (Stanford University), Gordon Wetzstein (Stanford University)

#49 GraPE : A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Ashish Goswami (IIT-Delhi), Satyam Modi (IIT-Delhi), Santhosh Rishi Deshineni (IIT-Delhi), Harman Singh (IIT-Delhi), Prathosh A.P (IISc Bangalore), Parag Singla (IIT- Delhi)

#50 Differential Attention for Multimodal Crisis Event Analysis

Nusrat Munia (University of Kentucky), Junfeng Zhu (University of Kentucky), Olfa Nasraoui (University of Louisville), Abdullah-Al-Zubaer Imran (University of Kentucky)

#51 Multimodal Prompting for Parameter-Efficient Audio-Visual Learning

Kai Wang (University of Toronto), Shentong Mo (Carnegie Mellon University), Yapeng Tian (The University of Texas at Dallas), Dimitrios Hatzinakos (University of Toronto)

#52 TemporalBench: Evaluating Fine-Grained Temporal Dynamics Understanding for Multimodal Models

Mu Cai (University of Wisconsin-Madison), Reuben Tan (Microsoft), Jianrui Zhang (University of Wisconsin-Madison), Bocheng Zou (University of Wisconsin- Madison), Kai Zhang (Ohio State University), Feng Yao (UCSD), Fangrui Zhu (Northeastern University), Jing Gu (University of California, Santa Cruz), Yiwu Zhong (CUHK), Yuzhang Shang (Illinois Institute of Technology), Yao Dou (Georgia Tech), Jaden Park ( University of Wisconsin-Madison), Jianfeng Gao (Microsoft), Yong Jae Lee (University of Wisconsin-Madison), Jianwei Yang (Microsoft)

#53 Vinoground: Today’s LMMs Don’t Understand Short Counterfactual Videos

Jianrui Zhang (University of Wisconsin-Madison), Mu Cai (University of Wisconsin-Madison), Yong Jae Lee (University of Wisconsin-Madison)

#54 sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment

Yijun Liu (Signal and Image Processing Institute at University of Southern California)

Main CVPR Papers

#55 Yo’Chameleon: Personalized Vision and Language Generation

Thao Nguyen (University of Wisconsin-Madison), Krishna Kumar Singh (Adobe Research), Jing Shi (Adobe Research), Trung Bui (Adobe Research), Yong Jae Lee (University of Wisconsin-Madison), Yuheng Li (Adobe Research)

#56 LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Dominick Reilly (University of North Carolina at Charlotte), Rajatsubhra Chakraborty (University of North Carolina at Charlotte), Arkaprava Sinha (University of North Carolina at Charlotte), Manish Kumar Govind (University of North Carolina at Charlotte), Pu Wang (University of North Carolina at Charlotte), Francois Bremond (INRIA), Le Xue (Salesforce AI Research), Srijan Das (University of North Carolina at Charlotte)

#57 Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Hyeonggon Ryu (KAIST), Seongyu Kim (KAIST), Joon Son Chung (KAIST), Arda Senocak (KAIST)

#58 Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Federico Cocchi (University of Modena and Reggio Emilia), Nicholas Moratelli (University of Modena and Reggio Emilia ), Marcella Cornia (University of Modena and Reggio Emilia), Lorenzo Baraldi (University of Modena and Reggio Emilia), Rita Cucchiara (University of Modena and Reggio Emilia)

#59 Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Davide Caffagni (University of Modena and Reggio Emilia), Sara Sarto (University of Modena and Reggio Emilia), Marcella Cornia ( University of Modena and Reggio Emilia), Lorenzo Baraldi ( University of Modena and Reggio Emilia), Rita Cucchiara ( University of Modena and Reggio Emilia)

#60 VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

Darshana Saravanan (IIIT Hyderabad), Varun Gupta (IIIT Hyderabad), Darshan Singh (IIIT Hyderabad), Zeeshan Khan (Inria, Paris), Vineet Gandhi (IIIT Hyderabad), Makarand Tapaswi (IIIT Hyderabad)

#61 CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Edson Araujo (Goethe University Frankfurt), Andrew Rouditchenko (Massachusetts Institute of Technology), Yuan Gong (Massachusetts Institute of Technology ), Saurabhchand Bhati (Massachusetts Institute of Technology ), Samuel Thomas (IBM Research), Brian Kingsbury (IBM Research), Leonid Karlinsky (IBM Research ), Rogerio Feris (IBM Research), James Glass (Massachusetts Institute of Technology ), Hilde Kuehne (University of Tuebingen)

#62 Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Yunseok Jang (University of Michigan), Yeda Song (University of Michigan), Sungryull Sohn (LG Al Research), Lajanugen Logeswaran (LG Al Research ), Tiange Luo (University of Michigan), Dong-Ki Kim (LG Al Research), Kyunghoon Bae (LG Al Research), Honglak Lee (University of Michigan)

#63 BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Md Mohaiminul Islam (UNC Chapel Hill), Tushar Nagarajan (Meta), Huiyu Wang (Meta), Gedas Bertasius (UNC Chapel Hill), Lorenzo Torresani (Meta)

#64 VITED: Video Temporal Evidence Distillation

Yujie Lu (UCSB), Yale Song (FAIR, Meta), William Wang (UCSB), Lorenzo Torresani (FAIR, Meta), Tushar Nagarajan (FAIR, Meta)

#65 STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Aaryan Garg (BITS Pilani), Akash Kumar (University of Central Florida), Yogesh S Rawat (UCF)

#66 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

James Burgess (Stanford University)

Page updated

Google Sites

Report abuse