08:00am - Poster Setup
08:25am - Workshop Kickoff and Opening Comments
08:30am - Ranjay Krishna
09:00am - Yong Jae Lee
09:30am - Poster session + break
Poster Session will be in Exhall II - Please place your poster in the spot specified in the accepted
posters below
10:30am - Mohit Bansal
10:50am - Hongxu (Danny) Yin
11:10am - Georgia Gkioxari
11:30pm - Panel: What is Next in Multimodal Foundation Models?
Ranjay Krishna, Mohit Bansal, Danny Yin, Georgia Gkioxari
Panel Chair: David Chan
Ask your questions to the panel here!
12:00pm - Concluding Remarks
Ranjay Krishna is an Assistant Professor at the Allen School of Computer Science & Engineering. He co-directs the RAIVN lab at UW and directs the PRIOR team at Ai2. His research lies at the intersection of computer vision, natural language processing, robotics, and human computer interaction. This research has received best paper, outstanding paper, and orals at CVPR, ACL, CSCW, NeurIPS, UIST, and ECCV, and has been reported by Science, Forbes, the Wall Street Journal, and PBS NOVA. His research has been supported by Google, Apple, Ai2, Amazon, Cisco, Toyota Motor Inc, Toyota Research Institute, NSF, ONR, and Yahoo. He holds a bachelor's degree in Electrical & Computer Engineering and in Computer Science from Cornell University, a master's degree in Computer Science from Stanford University and a Ph.D. in Computer Science from Stanford University.
Yong Jae Lee is a Professor in the Department of Computer Sciences at the University of Wisconsin-Madison and a Research Scientist at Adobe Research. His research interests are in computer vision and machine learning, with a focus on robust AI systems that learn to understand the multimodal world with minimal human supervision.
Dr. Mohit Bansal is the John R. & Louise S. Parker Distinguished Professor and the Director of the MURGe-Lab (UNC-AI Group) in the Computer Science department at UNC Chapel Hill. He received his PhD from UC Berkeley in 2013 and his BTech from IIT Kanpur in 2008. His research expertise is in multimodal generative models, reasoning and planning agents, faithful language generation, and interpretable, efficient, and generalizable deep learning. He is a AAAI Fellow and recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE), IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, CoNLL, and TMLR. He has been a keynote speaker for the ECAI 2025, ACM-CODS 2025, AACL-IJCNLP 2023, CoNLL 2023, and INLG 2022 conferences. His service includes EMNLP Program Co-Chair, CoNLL Program Co-Chair, and ACL Executive Committee, ACM Doctoral Dissertation Award Committee, ACL Doctoral Dissertation Award Co-Organizer, ACL Mentorship Program Co-Founder, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals. Webpage: https://www.cs.unc.edu/~mbansal/
Hongxu Danny Yin is a principal research scientist and tech lead at NVIDIA research. He got his PhD from Princeton University. He is currently leading NVIDIA’s multimodal herd of VILA VLMs launched at NVIDIA GTC, overseeing multimodal post-training, agents, and reasoning, with full-stack optimization for NVIDIA GPU/Jetson/Thor. He also works on efficient LLMs and vision encoders.
Georgia is an assistant professor at the Computing + Mathematical Sciences at Caltech. She obtained her PhD in Electrical Engineering and Computer Science from UC Berkeley, where she was advised by Jitendra Malik. Prior to Berkeley, she earned her diploma from the National Technical University of Athens in Greece. After earning her PhD, she was a research scientist at Meta's FAIR team. In 2021, she received the PAMI Young Researcher Award, which recognizes a young researcher for their distinguished research contribution to computer vision. She is the recipient of the PAMI Mark Everingham Award for the open-source software suite Detectron (2021), the Google Faculty Award (2024) and the Okawa Research Award (2024). In 2017, Georgia and her co-authors received the Marr Prize for “Mask R-CNN” published and presented at ICCV. She was named one of 30 influential women advancing AI in 2019 by ReWork and was nominated for the Women in AI Awards in 2020 by VentureBeat.
1. Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos (Poster 40)
2. Preview Deep-Sound: Start to Employ Step-by-Step Priors in the Audio Generation from Videos (Poster 41)
3. Linear Attention Meets Sparse Vision Transformers: Hierarchical Sparse Multi-scale Linear Attention for Fast and Accurate Super-Resolution (Poster 42)
4. CAT: Content-Adaptive Image Tokenization (Poster 43)
5. Learning by Taking Notes: Memory-Guided Continual Learning for Generative Multimodal Models (Poster 44)
6. CobraVPS: Code Template Optimization for Better Question Reasoning Accuracy with Visual Program Synthesis (Poster 45)
7. HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding (Poster 46)
8. A Plug-and-Play Approach for Robust Image Editing in Text-to-Image Diffusion Models (Poster 47)
9. DisenQ: Disentangling Q-Former for Activity-Biometrics (Poster 48)
10. GLAD: Generalizable Tuning for Vision-Language Models (Poster 49)
11. CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts (Poster 50)
12. LOCATEdit : Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing (Poster 51)
13. Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data (Poster 52)
14. Roboflow100-VL: A Multi-Modal Object Detection Benchmark for Vision-Language Models (Poster 53)
15. Towards Agentic AI for Multimodal-Guided Video Object Segmentation (Poster 54)
16. LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning (Poster 55)
17. TULIP: Contrastive Image-Text Learning With Richer Vision Understanding (Poster 56)
18. What Holds Back Open-Vocabulary Segmentation? (Poster 57)
19. Audio-Visual LLM for Video Understanding (Poster 58)
20. Generate, Transduct, Adapt: Iterative Transduction with VLMs (Poster 59)
21. MORFI: Mutimodal Zero-Shot Reasoning for Financial Time-Series Inference (Poster 60)
22. Infusing fine-grained visual knowledge to Vision-Language Models (Poster 61)
23. Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model (Poster 62)
24. Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning (Poster 63)
25. VGGSounder: Audio-Visual Evaluations for Foundation Models (Poster 64)
26. Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection (Poster 65)
27. HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos (Poster 66)
28. DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding (Poster 67)
29. Evaluating Variance in Visual Question Answering Benchmarks (Poster 68)
30. Uncertainty-Aware ControlNet: Bridging Domain Gaps with Synthetic Image Generation (Poster 69)
31. Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models (Poster 70)
32. Hierarchical Entailment Representations for Linguistic Compositionality in Language-based Object Detection (Poster 188)
33. Enhancing Circuit Diagram Understanding via Near Sight Correction Using VLMs (Poster 189)
34. CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems (Poster 190)
35. X-Fusion: Introducing New Modality to Frozen Large Language Models (Poster 191)
36. Mitigating Language Confusion for Multimodal Foundation Models via Confusion-Aware Preference Optimization Pipeline (Poster 192)
37. Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark (Poster 193)
38. Meta-Learned Prompt Distillation for Multimodal Few-Shot Learning (Poster 194)