#31 UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
Yang Jiao (Fudan University), Haibo Qiu (Meituan), Zequn Jie (Meituan), Shaoxiang Chen (Meituan), Jingjing Chen (Fudan University), Lin Ma (Meituan), Yu-Gang Jiang (Fudan University)
#32 Understanding Depth and Height Perception in Large Visual-Language Models
Shehreen Azad (University of Central Florida), Yash Jain (Microsoft Research), Rishit Garg (Indian Institute of Technology Kharagpur), Vibhav Vineet (Microsoft Research) Yogesh S Rawat (University of Central Florida)
#33 Repurposing SAM for User-Defined Semantics Aware Segmentation
Rohit Kundu (University of California, Riverside), Sudipta Paul (Samsung Research America), Arindam Dutta (University of California, Riverside), Amit Roy- Chowdhury (University of California, Riverside)
#34 PLVM: A tuning-free approach for Personalized Large Vision-Language Model
Chau Pham (University at Buffalo), Hoang Phan (New York University), David Doermann (University at Buffalo), Yunjie Tian (University at Buffalo)
#35 How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Muhammad Uzair Khattak (Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)), Muhammad Ferjad Naeem (Google), Jameel Hassan (MBZUAI), Muzammal Naseer (Khalifa University), Federico Tombari (Google), Fahad Shahbaz Khan (MBZUAI), Salman Khan (MBZUAI)
#36 An Interactive Agent Foundation Model
Zane Durante (Stanford University)
#37 Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
Neha Kalibhat (University of Maryland)
#38 PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
Mennatullah Siam (University of British Columbia)
#39 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Shiyu Xia (AI Lab, Giant Network), Chang Liu (AI Lab, Giant Network), Haomin Zhang (AI Lab, Giant Network), Zihao Chen (AI Lab, Giant Network), Chaofan Ding (AI Lab, Giant Network), Xin Yue (AI Lab, Giant Network), Huizhe Chen (AI Lab, Giant Network), XINHAN DI (Deepearthgo)
#40 RADAR: Robust Anomaly Detection And Recovery for Robot Manipulation
Liu Rui (Tianjin University), Ni Fei (Tianjin University), Kou Longxin (Tianjin University), Hao Jianye (Tianjin University)
#41 HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs
Nikitha SR (MDSR Lab, Adobe Systems), Aradhya Neeraj Mathur (MDSR Lab, Adobe Systems ), Tarun Ram Menta (MDSR Lab, Adobe Systems ), Rishabh Jain (MDSR Lab, Adobe Systems), Mausoom Sarkar (MDSR Lab, Adobe Systems)
#42 FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Christian Schlarmann (University of Tübingen), Francesco Croce (EPFL), Nicolas Flammarion (EPFL), Matthias Hein (University of Tübingen)
#43 From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
Dominick Reilly (University of North Carolina at Charlotte), Manish Kumar Govind (University of North Carolina at Charlotte), Le Xue (Salesforce AI Research), Srijan Das ( University of North Carolina at Charlotte)
#44 Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit
Kartheek Kumar Reddy Nareddy (German Aerospace Center), Sarah Ternus (German Aerospace Center), Julia Niebling (German Aerospace Center)
#45 HD-VILA-Caption: A Diverse Video-Text Dataset Derived from ASR Narrations
Maheen Saleh (MPI-Informatics Saarbrucken), Nina Shvetsova (Goethe University Frankfurt), Anna Kukleva (MPI-Informatics Saarbrucken), Hilde Kuehne (Goethe University Frankfurt), Bernt Schiele (MPI-Informatics Saarbrucken)
#46 RePOPE: Impact of annotation errors on the POPE benchmark
Yannic Neuhaus (University of Tübingen), Matthias Hein (University of Tübingen)
#47 HEFT: Multi-modal Personalized Federated Learning for Heterogeneous Clients
MinHyuk Seo (KU Leuven), Taeheon Kim (Seoul National University), Hankook Lee (Sungkyunkwan University), Jonghyun Choi (Seoul National University), Tinne Tuytelaars (KU Leuven)
#48 Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation
Jan Ackermann (ETH Zurich), Kiyohiro Nakayama (Stanford University), Tong Wu (Stanford University), Guandao Yang (Stanford University), Gordon Wetzstein (Stanford University)
#49 GraPE : A Generate-Plan-Edit Framework for Compositional T2I Synthesis
Ashish Goswami (IIT-Delhi), Satyam Modi (IIT-Delhi), Santhosh Rishi Deshineni (IIT-Delhi), Harman Singh (IIT-Delhi), Prathosh A.P (IISc Bangalore), Parag Singla (IIT- Delhi)
#50 Differential Attention for Multimodal Crisis Event Analysis
Nusrat Munia (University of Kentucky), Junfeng Zhu (University of Kentucky), Olfa Nasraoui (University of Louisville), Abdullah-Al-Zubaer Imran (University of Kentucky)
#51 Multimodal Prompting for Parameter-Efficient Audio-Visual Learning
Kai Wang (University of Toronto), Shentong Mo (Carnegie Mellon University), Yapeng Tian (The University of Texas at Dallas), Dimitrios Hatzinakos (University of Toronto)
#52 TemporalBench: Evaluating Fine-Grained Temporal Dynamics Understanding for Multimodal Models
Mu Cai (University of Wisconsin-Madison), Reuben Tan (Microsoft), Jianrui Zhang (University of Wisconsin-Madison), Bocheng Zou (University of Wisconsin- Madison), Kai Zhang (Ohio State University), Feng Yao (UCSD), Fangrui Zhu (Northeastern University), Jing Gu (University of California, Santa Cruz), Yiwu Zhong (CUHK), Yuzhang Shang (Illinois Institute of Technology), Yao Dou (Georgia Tech), Jaden Park ( University of Wisconsin-Madison), Jianfeng Gao (Microsoft), Yong Jae Lee (University of Wisconsin-Madison), Jianwei Yang (Microsoft)
#53 Vinoground: Today’s LMMs Don’t Understand Short Counterfactual Videos
Jianrui Zhang (University of Wisconsin-Madison), Mu Cai (University of Wisconsin-Madison), Yong Jae Lee (University of Wisconsin-Madison)
#54 sEEG-based Encoding for Sentence Retrieval: A Contrastive Learning Approach to Brain-Language Alignment
Yijun Liu (Signal and Image Processing Institute at University of Southern California)
#55 Yo’Chameleon: Personalized Vision and Language Generation
Thao Nguyen (University of Wisconsin-Madison), Krishna Kumar Singh (Adobe Research), Jing Shi (Adobe Research), Trung Bui (Adobe Research), Yong Jae Lee (University of Wisconsin-Madison), Yuheng Li (Adobe Research)
#56 LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Dominick Reilly (University of North Carolina at Charlotte), Rajatsubhra Chakraborty (University of North Carolina at Charlotte), Arkaprava Sinha (University of North Carolina at Charlotte), Manish Kumar Govind (University of North Carolina at Charlotte), Pu Wang (University of North Carolina at Charlotte), Francois Bremond (INRIA), Le Xue (Salesforce AI Research), Srijan Das (University of North Carolina at Charlotte)
#57 Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
Hyeonggon Ryu (KAIST), Seongyu Kim (KAIST), Joon Son Chung (KAIST), Arda Senocak (KAIST)
#58 Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Federico Cocchi (University of Modena and Reggio Emilia), Nicholas Moratelli (University of Modena and Reggio Emilia ), Marcella Cornia (University of Modena and Reggio Emilia), Lorenzo Baraldi (University of Modena and Reggio Emilia), Rita Cucchiara (University of Modena and Reggio Emilia)
#59 Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Davide Caffagni (University of Modena and Reggio Emilia), Sara Sarto (University of Modena and Reggio Emilia), Marcella Cornia ( University of Modena and Reggio Emilia), Lorenzo Baraldi ( University of Modena and Reggio Emilia), Rita Cucchiara ( University of Modena and Reggio Emilia)
#60 VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
Darshana Saravanan (IIIT Hyderabad), Varun Gupta (IIIT Hyderabad), Darshan Singh (IIIT Hyderabad), Zeeshan Khan (Inria, Paris), Vineet Gandhi (IIIT Hyderabad), Makarand Tapaswi (IIIT Hyderabad)
#61 CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo (Goethe University Frankfurt), Andrew Rouditchenko (Massachusetts Institute of Technology), Yuan Gong (Massachusetts Institute of Technology ), Saurabhchand Bhati (Massachusetts Institute of Technology ), Samuel Thomas (IBM Research), Brian Kingsbury (IBM Research), Leonid Karlinsky (IBM Research ), Rogerio Feris (IBM Research), James Glass (Massachusetts Institute of Technology ), Hilde Kuehne (University of Tuebingen)
#62 Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Yunseok Jang (University of Michigan), Yeda Song (University of Michigan), Sungryull Sohn (LG Al Research), Lajanugen Logeswaran (LG Al Research ), Tiange Luo (University of Michigan), Dong-Ki Kim (LG Al Research), Kyunghoon Bae (LG Al Research), Honglak Lee (University of Michigan)
#63 BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md Mohaiminul Islam (UNC Chapel Hill), Tushar Nagarajan (Meta), Huiyu Wang (Meta), Gedas Bertasius (UNC Chapel Hill), Lorenzo Torresani (Meta)
#64 VITED: Video Temporal Evidence Distillation
Yujie Lu (UCSB), Yale Song (FAIR, Meta), William Wang (UCSB), Lorenzo Torresani (FAIR, Meta), Tushar Nagarajan (FAIR, Meta)
#65 STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
Aaryan Garg (BITS Pilani), Akash Kumar (University of Central Florida), Yogesh S Rawat (UCF)
#66 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
James Burgess (Stanford University)