Workshop location: TBD
Poster session location: TBD
8:50 - 17:30 PST
The Transformer architecture has catalyzed a paradigm shift, unifying the fields of computer vision, natural language processing, and beyond. Originally transformative in NLP, its principles now underpin the most powerful foundation models. This includes state-of-the-art models across nearly all vision tasks, including image classification, sophisticated image and video generation, and a new generation of Multimodal LLMs (MLLMs) that seamlessly integrate vision, language, and other sensory inputs. These models are redefining the state-of-the-art in tasks ranging from visual question answering and embodied AI to generative content creation.However, this success has brought new challenges to the forefront. The quadratic complexity of the attention mechanism remains a bottleneck for high-resolution or long-sequence data, leading to excessive computational costs. Furthermore, the field is actively debating the future of visual backbones: Will Transformers continue to scale effectively? Are emerging alternatives, such as State Space Models (SSMs, e.g., Mamba), more efficient successors? How do we optimally design architectures for unified, multimodal understanding? This workshop aims to bring together a diverse set of researchers to share cutting-edge insights, debate the limitations of current models, and explore the next generation of architectures for visual recognition.
MIT / Google DeepMind
Meta FAIR
University of Oxford
UPenn
Carnegie Mellon University
Yan-Bo Lin (UNC)
Han Yi (UNC)
Yue Yang (UNC)
Fuxiao Liu (NVIDIA)
Jaehun Jung (NVIDIA)
Di Zhang (Fudan University)
Ce Zhang (UNC)
Seongsu Ha (UNC)
Ziyang Wang (UNC)
Fu-En (Fred) Yang (NVIDIA)
Guo Chen (Nanjing University)
Tianyi Xiong (University of Maryland, College Park)
Baiqi Li (UNC)
Yulu Pan (UNC)
Le An (NVIDIA)
Ryo Hachiuma (NVIDIA)
Shihao Wang (The Hong Kong Polytechnic University)