Accepted Papers
Spotlight Presentations
Video-kMaX: A Simple Unified Approach for Online and Near-Online Video Panoptic Segmentation. Inkyu Shin (KAIST); Dahun Kim (Google); Qihang Yu (Johns Hopkins University); Jun Xie (Google); Hong-Seok Kim (Google); Bradley Green (Google); In So Kweon (KAIST); Kuk-Jin Yoon (KAIST); Liang-Chieh Chen (ByteDance) [paper] [poster] [video] [supplementary]
Dual PatchNorm. Manoj Kumar (Google Brain); Mostafa Dehghani (Google Brain); Neil Houlsby (Google) [paper] [poster] [video] [supplementary]
Point2Vec for Self-Supervised Representation Learning on Point Clouds. Karim Abou Zeid (RWTH Aachen University); Jonas Schult (RWTH Aachen University); Alexander Hermans (RWTH Aachen University); Bastian Leibe (RWTH Aachen University) [paper] [poster] [video] [supplementary]
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer. Zhijian Liu (MIT); Xinyu Yang (Shanghai Jiaotong University); Haotian Tang (MIT) Shang Yang (Tsinghua University); Song Han (MIT) [paper] [poster] [video] [supplementary]
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer. Xuanyao Chen (Fudan University); Zhijian Liu (MIT); Haotian Tang (MIT); Li Yi (Tsinghua University); Hang Zhao (MIT); Song Han (MIT) [paper] [poster] [video] [supplementary]
RePAST: Relative Pose Attention Scene Representation Transformer. Aleksandr Safin (Skolkovo Institute of Science and Technology); Daniel Duckworth (Google Brain); Mehdi S. M. Sajjadi (Google Brain) [paper] [poster] [video] [supplementary]
OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios. Aditya N Ganesh (PES University); Dhruval Pobbathi Badrinath (PES University); Harshith Mohan Kumar (PES University); Priya S S (PES University); Surabhi Narayan (PES University) [paper] [poster] [video] [supplementary]
Clicks as Queries: Interactive Transformer for Multi-instance Segmentation. Amit Kumar Rana (RWTH Aachen University); Sabarinath Mahadevan (RWTH Aachen University); Alexander Hermans (RWTH Aachen University); Bastian Leibe (RWTH Aachen University) [paper] [poster] [video] [supplementary]
Joint Adaptive Representations for Image-Language Learning. AJ Piergiovanni (Google DeepMind); Anelia Angelova (Google DeepMind) [paper] [poster] [video] [supplementary]
PaReprop: Fast Parallelized Reversible Backpropagation. Tyler Zhu (UC Berkeley); Karttikeya Mangalam (UC Berkeley) [paper] [poster] [video] [supplementary]
Linear Attention, Graph and Compact Interhand Reconstruction. Jintao Sun (Beijing Institute of Technology); Zheng Guan (Beijing Institute of Technology); Gangyi Ding (Beijing Institute of Technology) [paper] [poster] [video] [supplementary]
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge. Wei Lin (Graz University of Technology); Leonid Karlinsky (IBM-Research); Nina Shvetsova (Goethe University Frankfurt); Horst Possegger (Graz University of Technology); Mateusz Kozinski (ICG TUGRAZ); Rameswar Panda (MIT-IBM Watson AI Lab); Rogerio Feris (MIT-IBM Watson AI Lab); Hilde Kuehne (Goethe University Frankfurt); Horst Bischof (Graz University of Technology) [paper] [poster] [video] [supplementary]
Poster Presentations
Lawin Transformer: Improving New-Era Vision Backbones with Multi-Scale Representations for Semantic Segmentation. Haotian Yan (Beijing University of Posts and Telecommunications); Chuang Zhang (Beijing University of Posts and Telecommunications); Ming Wu (Beijing University of Posts and Telecommunications) [paper] [poster] [supplementary]
Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision. Michael Hobley (University of Oxford); Victor Prisacariu (University of Oxford) [paper] [poster] [supplementary]
Efficient Vision Transformer for Human Pose Estimation via Patch Selection. Kaleab Kinfu (Johns Hopkins University); Rene Vidal (Johns Hopkins University) [paper] [poster] [supplementary]
Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer. Siddharth Sagar Nijhawan (Sony); Leo Hoshikawa (Sony); Atsushi Irie (Sony); Masakazu Yoshimura (Sony); Junji Otsuka (Sony); Takeshi Ohashi (Sony) [paper] [poster] [supplementary][code]
Diversifying Joint Vision-Language Tokenization Learning. Vardaan Pahuja (The Ohio State University); AJ Piergiovanni (Google); Anelia Angelova (Google) [paper] [poster] [supplementary]
3D Clothed Human Reconstruction with Integration of Transformer and CNN. Liangjing Shao (Fudan University); Xiaokun Dai (Fudan University); Xinhan Di (Deepearthgo); Benshuang Chen (Fudan University); Xinrong Chen (Fudan University) [paper] [poster] [supplementary]
SimDETR: Simplifying self-supervised pretraining for DETR. Ioannis Maniadis Metaxas (Queen Mary University of London); Adrian Bulat (Samsung AI Center, Cambridge); Ioannis Patras (Queen Mary University of London); Brais Martinez (Samsung AI Center); Georgios Tzimiropoulos (Queen Mary University of London) [paper] [poster] [supplementary]
Waterfall Transformer for Multi-person Pose Estimation. Navin Ranjan (Rochester Institute of Technology); Bruno Artacho (Rochester Institute of Technology); Andreas Savakis (Rochester Institute of Technology) [paper] [poster] [supplementary]
Vision-Code Transformer for Screenshot-to-HTML/CSS Generation. Davit Soselia (University of Maryland, College Park); Khalid Saifullah (University of Maryland, College Park); Tianyi Zhou (University of Maryland, College Park) [paper] [poster] [supplementary]