T4V @ CVPR 2024

June 18, 2024 (PDT)

8:20 - 8:30

Rohit Girdhar

Opening Remarks

8:30 - 9:00

IIija Radosavovic (UC Berkeley)

Large Robot Models

9:00 - 9:30

Hila Chefer (Tel Aviv University)

Attention in Action– Exploring and Exploiting the Attention Mechanism in Transformers

Abstract: The capabilities of Transformer-based models are far exceeding our expectations, and in many cases, our understanding of these models lags behind their success. In this talk, we will explore the hidden world of the internal representations learned by Transformer models, with a specific emphasis on the mechanism at the heart of their architecture: attention. We will begin by showcasing a novel approach that transforms attention maps into reliable explanations, shedding light on the reasoning behind the predictions of various Transformer models. Shifting gears, we will demonstrate how these interpretations serve not only as valuable visualization aids but also as powerful tools for controlling and correcting model behavior. We will demonstrate how applying intuitive desiderata to Vision Transformer (ViT) interpretation maps can mitigate critical issues such as reliance on biases and spurious correlations. Next, we will illustrate how the same approach enhances generative models, significantly improving text fidelity, all without the need for additional data and with very limited computational resources.

9:30 - 10:00

Spotlights #1

Semantic Vision Transformers
GTA: Guided Transfer of Spatial Attention from Object-Centric Representations
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens
State Space Models for Event Cameras
Parameter-Efficient Active Learning for Foundational models

10:00 - 10:30

Coffee Break

10:30 - 11:00

Antoine Miech (Google DeepMind)

Past, Present and Future of Vision Language Models with Transformers

Abstract: Transformers are at the core of the advances of Vision Language Models. Initially designed for specialized models, they have evolved into powerful, foundational architectures that excel at a diverse range of multimodal tasks. This talk will trace the remarkable progression of transformer models for vision language models at Google DeepMind, highlighting key innovations such as Flamingo, Gemini 1.0, and the latest iteration, Gemini 1.5. The presentation will conclude by examining the current limitations of foundational models and discussing their potential future directions in the short term.

11:00 - 11:30

Chelsea Finn (Stanford University)

Transformers for Humanoids and Robot Generalists

11:30 - 12:00

Yutong Bai (John Hopkins University)

Sequential Modeling Enables Scalable Learning for Large Vision Models

Abstract: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

12:00 - 12:30

Spotlights #2

Learning Visual Prompts for Guiding the Attention of Vision Transformers
ReduceFormer: Attention with Tensor Reduction by Summation
PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation
Mask4Former: Mask Transformer for 4D Panoptic Segmentation
Leveraging Camera Calibration Transformers Model using Line Mixed Queries

12:30 - 14:00

Lunch Break

14:00 - 14:30

Tim Brooks (Open AI)

Sora

14:30 - 15:00

Xiaolong Wang (UCSD)

Learning to (Learn at Test Time): Expressive State Representations for LLMsBD

Abstract: Self-attention performs well in long context but have quadratic complexity. RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training itself even on test sequences, our new class of layers is called TTT layers, where TTT stands for test-time training. We consider two simple instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a state-of-the-art Transformer and one of the most advanced RNNs called Mamba. TTT-Linear and TTT-MLP scale better than the baselines. They can also effectively take advantage of 32k context like Transformers, while Mamba cannot. With preliminary systems optimization, TTT-Linear is already faster than Transformers and Mamba in wall-clock time at 2k context length. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

15:00 - 15:30

Lili Yu (Meta AI, FAIR)

Advancing Multimodal Modeling: Exploring Innovative Transformer Architectures

In this presentation, I will explore three cutting-edge methods for multimodal modeling: Chameleon, MEGABYTE, and Transfusion.Chameleon, an early-fusion token-based model, excels in integrating and generating images and text, showcasing state-of-the-art performance in tasks such as image captioning and visual question answering. MEGABYTE, a multi-scale decoder architecture, overcomes the limitations of traditional autoregressive transformers by efficiently modeling long sequences, such as high-resolution images and audio, at the byte level. Lastly, Transfusion combines the strengths of autoregressive transformer language models for text with diffusion models for image generation, creating a highly efficient system for generating text and media. These models represent significant advancements in the field of multimodal modeling.

15:30 - 16:00

Piotr Bojanowski (Meta AI, FAIR)

Addressing the pitfalls of self-supervised learning with transformers

Abstract: Large-scale training of self-supervised transformers has allowed us to obtain robust data representations on nearly any domain. Features computed with DINOv2 show solid performance on many benchmarks, meeting the quality of CLIP-like models on categorization tasks and setting a new bar for dense prediction ones (segmentation, depth estimation). At the same time, those models show outstanding out-of-domain robustness, allowing one to run inference on drastically different inputs. In this talk, I will present two recent advances in this domain. First, I will discuss the surprising effect of using additional learnable input tokens (registers) to alleviate artifacts in feature maps. We observed that such tokens allow to obtain clear feature maps, for models trained with DINO, but also CLIP or even supervised criterions. Second, I will present our recent work challenging the importance of data augmentation for training self-supervised transformers. Our study shows that previous conclusions don’t necessarily hold as long as one trains models on sufficiently large datasets.

16:00 - 17:00

Panel Discussion

Moderators: Lucas Beyer (Google Brain), Alaaeldin El-Nouby (Apple).

17:00 - 18:00

Posters

Please put up the posters in

Arch Building Exhibit Hall #150 - #249.

Page updated

Google Sites

Report abuse