June 19, 2022 (CST)
9:30 - 10:00
Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer
Is Large-scale Pre-training always Necessary for Vision Transformers?
Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks
MC-SSL: Towards Multi-Concept Self-Supervised Learning
Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
10:00 - 10:30
Coffee Break + Poster Session
10:30 - 11:00
Zhuowen Tu (UCSD)
Transformers for Structural Extraction
11:00 - 11:30
Jean-Baptiste Alayrac (DeepMind)
12:00 - 12:30
Enabling Faster Vision Transformers via Soft Token Pruning
GroupViT: Semantic Segmentation Emerges from Text Supervision
Learned Queries for Efficient Local Attention
Adversarial Token Attacks on Vision Transformers
GradViT: Gradient Inversion of Vision Transformers
Visual Attention Emerges from Recurrent Sparse Reconstruction
12:30 - 13:30
Lunch Break
14:30 - 15:00
Saining Xie (Meta)
Everything is All You Need: Vision Architectures for the 2020s
15:00 - 15:30
NEAT: Neural Attention Fields for End-to-End Autonomous Driving
BoxeR: Box-Attention for 2D and 3D Transformers
Depth Estimation with Simplified Transformer
M2F3D: Mask2Former for 3D Instance Segmentation
Helix4D: Online Semantic Segmentation of LiDAR Sequences
17:00 - 18:00
Poster Session