This is the final program for our workshop. Event times shown in the schedule are local times in Nashville TN, USA.
Date: 12 June 2025
Location: Room 101 E, Music City Center, Nashville, TN
Poster Session: ExHall D, Boards #14-24
Adam Harley
Meta
Title: Point Tracking Tomorrow: Dense, Fast, and Data-Driven
The recent progress in point tracking has been staggering, but current methods still struggle to deliver what we want: dense, high-resolution, drift-free, occlusion-resistant tracking. In this talk, I will revisit the foundations of point tracking, and highlight its complex ties to optical flow and 3D scene understanding. I will introduce our new "AllTracker" method, which finally delivers dense (all-pixel) tracking at high resolution. I will also discuss our progress on 3D and semantics-informed point trackers, and outline what to expect in the near future of this area.
Cordelia Schmid
Google & Inria
Dense grounded video object captioning
Fatma Guney
Koc University
Efficient Online Long-Term Point Tracking
I will talk about long-term point tracking, specifically how to make it efficient and online. First, I will talk about our work on evaluating the geometric awareness of visual foundation models for long-term point tracking:
in zero-shot settings, without any training;
by probing with low-capacity layers;
by fine-tuning with Low Rank Adaptation (LoRA).
Then, I will talk about our recent work, Track-On, on online point tracking on a frame-by-frame basis, making it suitable for real-world scenarios. We introduce a simple transformer-based model, augmented with memory modules to capture temporal information and maintain reliable point tracking over long time horizons.
Hengshuang Zhao
University of Hong Kong
Vision Foundation Models with Spatial Intelligence
With the enhancement of deep learning model capabilities and the efficient acquisition and utilization of massive amounts of data, the construction of large-scale vision foundation models has garnered widespread attention. These vision foundation models exhibit strong generalization capabilities in handling multiple tasks within complex visual scenes across different domains. However, they usually focus on images and videos, ignoring the ability to understand high-dimensional visual scenarios with essential spatial properties. To address these limitations, we alternatively explore vision foundation models with spatial intelligence in higher dimensions like 2.5D and 3D. In this talk, I will present a series of our recent research works on empowering vision foundation models with spatial intelligence, and their downstream applications like autonomous driving and fancy scenarios, and discuss several existing challenges and future frontiers for vision foundation models.
Ishan Misra
Meta
Foundation models for video generation, editing and personalization
Movie Gen is a cast of media-generation foundation models that enables users to use simple text inputs to generate high-quality videos, personalize or edit them, and add audio. When the generations are evaluated by humans, on all of these tasks Movie Gen establishes new state-of-the-art performance compared to existing solutions. In this talk, I'll focus on the core challenges of training such foundation models, and on the key simplifications/scaling trends that enable SOTA performance.
Xueyan Zou
University of California San Diego
Pixel, Foundation Models, and Embodied Intelligence
Pixel understanding has long been a core component of computer vision and essential for recognizing and interpreting the world. Recently, with the rapid progress of foundation models and large language models, pixel-level understanding has become a subtask within these powerful models. These foundation models are capable of effectively modeling both human intelligence and the physical world. As a result, intelligence-powered robotics is experiencing rapid growth. In this talk, we will introduce key research developments along these lines and discuss how to stay up to date with the latest advancements in the field.
Contributed Talk 1
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation.
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta
Contributed Talk 2
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos.
Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan
Contributed Talk 3
Hierarchical Semantic Segmentation with Autoregressive Language Modeling.
Josh Myers-Dean, Brian Price, Yifei Fan, Danna Gurari
Contributed Talk 4
Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation.
Gabriele Rosi, Fabio Cermelli
Contributed Talk 5
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding.
Aaryan Garg, Akash Kumar, Yogesh S Rawat