on June 3, 2026

The 5th Workshop on Transformers for Vision and Multimodal AI (T4V)

at CVPR 2026

Workshop location: Room 607
1:45 - 5:40 PM MST

Zoom Link (passcode: t4v-2026)

A half-day summit to bring together the latest ideas in using Transformers for image, video, 3D and multi-modal visual processing

Overview

The Transformer architecture has catalyzed a paradigm shift, unifying the fields of computer vision, natural language processing, and beyond. Originally transformative in NLP, its principles now underpin the most powerful foundation models. This includes state-of-the-art models across nearly all vision tasks, including image classification, sophisticated image and video generation, and a new generation of Multimodal LLMs (MLLMs) that seamlessly integrate vision, language, and other sensory inputs. These models are redefining the state-of-the-art in tasks ranging from visual question answering and embodied AI to generative content creation. However, this success has brought new challenges to the forefront. The quadratic complexity of the attention mechanism remains a bottleneck for high-resolution or long-sequence data, leading to excessive computational costs. Furthermore, the field is actively debating the future of visual backbones: Will Transformers continue to scale effectively? Are emerging alternatives, such as State Space Models (SSMs, e.g., Mamba), more efficient successors? How do we optimally design architectures for unified, multimodal understanding? This workshop aims to bring together a diverse set of researchers to share cutting-edge insights, debate the limitations of current models, and explore the next generation of architectures for visual recognition.