Program
08:30am - Welcome
08:40am - Keynote Talk 1: Song Han
09:10am - Keynote Talk 2: Cordelia Schmid
09:40am - Poster session + coffee break (Arch Building Exhibit Hall)
11:00am - Keynote Talk 3: Yong Jae Lee
11:30am - Panel: What is Next in Multimodal Foundation Models?
12:30pm - Challenge Award and Winner Presentation
12:50pm - Concluding Remarks
Accepted Papers
Full-length Papers
Poster ID #48: "Strategies to Leverage Foundational Model Knowledge in Object Affordance Grounding", Arushi Rai, R Buettner, Adriana Kovashka pdf
Poster ID #49: "Recognize Anything: A Strong Image Tagging Model", Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, Yandong Guo, Lei Zhang pdf
Poster ID #50: "ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models", Avinash Madasu, Vasudev Lal pdf
Poster ID #51: "Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters", S Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin pdf
Poster ID #52: "Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models", Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi pdf
Poster ID #53: "LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning", Junchi Wang, Lei Ke pdf
Poster ID #54: "Matting Anything", Jiachen Li, Jitesh Jain, Humphrey Shi pdf
Poster ID #55: "Robustness Analysis on Foundational Segmentation Models", C Schiappa, Shehreen Azad, sachidanand VS, Yunhao Ge, Ondrej Miksik, Yogesh Rawat, Vibhav Vineet pdf
Poster ID #56: "Probing Conceptual Understanding of Large Visual-Language Models", C Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat pdf
Poster ID #57 "Show, Think, and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning", Byoungjip Kim, Dasol Hwang, Sungjun Cho, Youngsoo Jang, Honglak Lee, Moontae Lee pdf
Poster ID #58: "Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs", Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara pdf
Poster ID #59: "Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity", Zhenlin Xu, Yi Zhu, Siqi Deng, Abhay Mittal, Yanbei Chen, wangmanchen1995@gmail.com Wang, Paolo Favaro, Joseph Tighe, Davide Modolo pdf
Poster ID #60: "Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation", Kai Wang, Yapeng Tian, Dimitrios Hatzinakos pdf video poster
Extended Abstracts
Poster ID #61: "SILC: Improving Vision Language Pretraining with Self-Distillation", Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van, Federico Tombari pdf
Poster ID #62: "Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition", Youngtaek Oh, Pyunghwan Ahn, Jinhyung Kim, Gwangmo Song, Soonyoung Lee, So Kweon, Junmo Kim pdf
Poster ID #63: "Learning to Prompt with Text Only Supervision for Vision-Language Models", Uzair Khattak, Ferjad Naeem, Muzammal Naseer, Luc Van, Federico Tombari pdf
Poster ID #64: "As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?", Anjun Hu, Jindong Gu, Francesco Pinto, Konstantinos Kamnitsas, Philip Torr pdf
Poster ID #65: "Frozen Transformers in Language Models Are Effective Visual Encoder Layers", Ziqi Pang, ZiYang Xie, Yunze Man, Yu-Xiong Wang pdf
Poster ID #66: "Compositional Learning for Vision-Language Reinforcement Learning Agents", Zijun Lin, Haidi Azaman, Ganesh Kumar, Cheston Tan pdf
Poster ID #67: "Stop Reasoning! When Multimodal LLMs with Chain-of-Thought Reasoning Meet Adversarial Images", Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, Jindong Gu pdf
Poster ID #68: "Linear Alignment of Vision-language Models for Image Captioning", Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, Thomas Adler pdf
Poster ID #69: "Training-Free Semantic Segmentation via LLM-Supervision", Yingjun Du, Wenfang Sun, Gaowen Liu, Ramana Kompella, Cees Snoek pdf
Poster ID #70: "Look, Remember and Reason: Grounded Reasoning in Videos with Language Models", Apratim Bhattacharyya, P Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic pdf
Poster ID #71: "What to Say and When to Say it: A Video-Language Model and Benchmark for Situated Interactions", Apratim Bhattacharyya, P Panchal, F. Berger, Antoine Mercier, Cornelius Böhm, Florian Dietrichkeit, Xuanlin Li, Reza Pourreza, Pulkit Madan, Sanjay Haresh, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic pdf
Poster ID #72: "Are Vision Language Models Texture or Shape Biased and Can We Steer Them?", Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Jehanzeb Mirza, Margret Keuper, Janis Keuper pdf poster
Poster ID #73: "Toward a Diffusion-Based Generalist for Dense Vision Tasks", Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Ferjad Naeem, Bernt Schiele, Federico Tombari pdf
Poster ID #74: "Synthesizing Image with High-Quality Segmentation Mask by Prompting Large Vision Model", Xuan-Tuyen Tran pdf
Poster ID #75: "BLINK: Multimodal Large Language Models Can See but Not Perceive", Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, A Smith, Wei-Chiu Ma, Ranjay Krishna pdf
Poster ID #76: "VideoAgent: Long-form Video Understanding with Large Language Model as Agent", Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy pdf
Poster ID #77: "Conceptual-Learning via Latent Approximations for Reinforcing Interpretability and Transparency", Maor Dikter, Tsachi Blau, Chaim Baskin pdf
Poster ID #78: "Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data", Yuhui Zhang, Elaine Sui, Serena Yeung-Levy pdf
Poster ID #79: "OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation", Gonca Yilmaz, Songyou Peng, Francis Engelmann, Marc Pollefeys, Hermann Blum pdf
Poster ID #80: "Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering", Bowen Jiang, Zhijun Zhuang, Skandan Shivakumar, Dan Roth, Jose Taylor pdf
Poster ID #81: "Diffusion Models for Improved Compositional Generalisation in VLMs", Beth Pearson, Martha Lewis, Michael Wray pdf
Poster ID #82: "LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model", Matthew Olson, Shao-Yen Tseng, Vasudev Lal, David Cobbley, Musashi Hinck pdf
Poster ID #83: "Accurate Medical Image Classification using Contrastive Graph Cross-View Learning with Multimodal Fusion", Jun-En Ding pdf
Poster ID #84: "MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens", Kirolos Ataallah, xiaoqian shen, mohamed abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny pdf
Poster ID #85: "Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?", Chandana K, Srinivas Kancheti, Gowtham Reddy, Vineeth N pdf
Poster ID #86: "iMotion-LLM: Motion Prediction Instruction Tuning", Abdulwahab Felemban, mohamed abdelrahman, xiaoqian shen, Jian Ding, A Mohamed, Mohamed Elhoseiny pdf
Poster ID #87: "Can CLIP Help Visual Sound Localization?", Sooyoung Park, Arda Senocak, Son Chung pdf
Poster ID #88: "MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning", Jun Chen, Deyao Zhu, xiaoqian shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny pdf
Main CVPR Papers
Poster ID #89: "Low-Resource Vision Challenges for Foundation Models", Yunhua Zhang, Hazel Doughty, Cees Snoek pdf
Poster ID #90: "FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication", Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle pdf
Poster ID #91: "ViT-Lens: Towards Omni-modal Representations", Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Zheng Shou pdf
Poster ID #92: "SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities", Boyuan Chen, Zhuo Xu, Kirmani Sean, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, Fei Xia pdf
Poster ID #93: "Any-Shift Prompting for Generalization over Distributions", Zehao Xiao, Jiayi Shen, Mahdi Derakhshani, Shengcai Liao, Cees Snoek pdf
Poster ID #94: "Grounding Everything: Emerging Localization Properties in Vision-Language Transformers", Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne pdf
Poster ID #95: "The Neglected Tails of Vision-Language Models", Shubham Parashar, Shu Kong, Tian Liu, James Caverlee, Zhiqiu Lin, Deva Ramanan, Yanan Li, Xiangjue Dong pdf
Poster ID #96: "Situational Awareness Matters in 3D Vision Language Reasoning", Yunze Man, Liangyan Gui, Yu-Xiong Wang pdf
Poster ID #97: "HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models", Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou pdf
Poster ID #98: "PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs", Michael Dorkenwald, Nimrod Barazani, Cees Snoek, M Asano pdf
Poster ID #99: "Honeybee: Locality-enhanced Projector for Multimodal LLM", Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh pdf
Poster ID #100: "X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization", Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma pdf video poster
Poster ID #101: "Multi-Modal Hallucination Control by Visual Information Grounding", Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto pdf
Poster ID #102: "Describing Differences in Image Sets with Natural Language", Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, E Gonzalez, Serena Yeung-Levy pdf
Poster ID #103: "Segment Every Out-of-Distribution Object", Wenjie Zhao, Jia Li, Xin Dong, Yu Xiang, Yunhui Guo pdf
Poster ID #104: "Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features", Shekhar Dutt, Sanjeev Muralikrishnan, niloy mitra pdf
Poster ID #105: "Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions", Oindrila Saha, Grant Van, Subhransu Maji pdf
Poster ID #106: "Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now", G Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David Forsyth, Anand Bhattad pdf