Speaker: Daniel Bolya
Title: Perception Encoder: State-of-the-Art Unified Image-Video CLIP Models with Surprisingly General Features
Abstract: We introduce Perception Encoder (PE), a family of state-of-the-art vision encoders for image and video understanding. Traditionally, vision encoders have relied on a variety of pretraining objectives, each excelling at different downstream tasks. Surprisingly, after scaling a carefully tuned image pretraining recipe and refining with a robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods: language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together, our PE family of models achieves state-of-the-art results on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, tracking, and depth estimation. To foster further research, we will release our models, code, and novel dataset of synthetically and human-annotated videos.
Speaker: Andrea Vedaldi
Title: Scaling models of geometry
Abstract: While scaling has been extensively studied in language and vision transformer models, comparatively little attention has been given to neural networks for 2D and 3D geometry estimation tasks. In this talk, I will first present VGGT, a large transformer network capable of performing 3D reconstruction in a manner similar to COLMAP—but faster, more reliably, and, crucially, using only off-the-shelf components without any post-processing optimisation. I will highlight the critical role of balancing data quantity and quality in training this model. Next, I will introduce CoTracker3, the latest version of our tracking system, and examine its scaling behaviour. In particular, while earlier versions of CoTracker were trained exclusively on synthetic datasets, I will show that a simple self-training protocol enables CoTracker to learn from large quantities of unlabelled real video, significantly improving its final performance.
Speaker: Xinlong Wang
Title: Unifying multimodal learning at scale
Abstract: While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models and compositional approaches such as CLIP combined with LLMs. In recent work, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, videos and even actions into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation, perception tasks, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction can effectively unify multimodal learning, marking a promising path towards building general multimodal intelligence beyond language.
Speaker: Chen Change Loy
Title: From Segment Anything Efficiently to Matting Anyone Precisely
Abstract: I will present a unified view of recent advances in efficient segmentation and matting, beginning with EdgeSAM - a compact, prompt-aware variant of Segment Anything that achieves real-time segmentation on mobile devices via prompt-in-the-loop distillation. Building on this, EdgeTAM extends promptable segmentation to video with a spatial perceiver for memory compression and a two-stage distillation pipeline, enabling high-quality tracking at 16 FPS on smartphones. Finally, MatAnyone introduces a memory-based framework for video matting, achieving fine boundary details and temporal consistency through region-adaptive memory fusion and segmentation-guided supervision.
Speaker: Tri Dao
Title: Designing Hardware-efficient Architectures for Sequence Modeling
Abstract: Thanks to scaling laws, training and inference efficiency now drive progress in AI, demanding a greater emphasis on hardware-aware architectures. Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We describe recent progress on architectures such structured state space models (SSMs) and attention variants to speed up training and inference. We identify 3 main ingredients for strong sub-quadratic architectures: large state size, expressive state updates, and efficient hardware-aware algorithms. We then discuss how to design attention variants to optimally use the memory and compute subsystems of modern accelerators. Finally we describe to combine these two classes of models to improve the efficiency of language models and vision models.
Speaker: Ishan Misra
Title: Foundational models for video generation, editing and personalization
Abstract: Generative models for video predict in a high dimensional spatiotemporal space. These models face enormous computational challenges both for training and inference. In this talk, I'll talk about our recent work called MovieGen, in which we show an efficient way to train such a large foundational model for video generation, editing and personalization. I'll discuss ways in which model inference for such models can be sped up significantly leading to more efficient and higher quality inference.