CVPR 2022 Tutorial

Beyond Convolutional Neural Networks

Neil Houlsby, Alexander Kolesnikov, Alexey Dosovitskiy, Xiaohua Zhai


Recording here.

Slides for each talk linked below in Agenda.


Convolutional Neural Networks (CNNs) have been the go-to architecture for Computer Vision tasks for the last decade. However, in the past 18-24 months, Computer Vision has witnessed massive growth in the number of new architectural designs. This tutorial will focus on two new (related) classes of architectures: Transformer-based designs, such as DETR and Vision Transformer, and MLP-based designs, such as MLP-Mixer and ResMLP.

The tutorial will provide background on the emergence of these models, as well as a review of various improvements and extensions. We will review these architectures in the context of classification, other tasks (detection, etc.), unsupervised, and multi-modal learning. We expect that the tutorial will be of interest to most CVPR participants, and bring beginners and experts alike up-to-speed with the modern wave of non-convolutional approaches to architectural design.


Location: CVPR 2022, New Orleans. Hybrid in-person/remote.

Date: 20th June, 2022 (morning)


0825 - 0830 Introduction

0830 - 0910 History of non-convolutional layers [Alex Kolesnikov] [slides]

0910 - 1015 The emergence of new architecture designs [Neil Houlsby] [slides]

1015 - 1030 Break

1030 - 1115 Beyond image classification [Alexey Dosovitskiy] [slides]

1115 - 1215 Multi-modal and self-supervised learning [Xiaohua Zhai] [slides]

1215 - 1230 Q&A

Topics: The tutorial will consist of four talks, covering the development and usages of these new architectures, focusing primarily on recent Transformer and MLP-based designs:

  1. History of non-convolutional layers. This section covers key advances in non-local design of neural vision architectures, which preceded widespread adoption of the Vision Transformer model. We will provide a context for the rest of the tutorial and cover the relevant prior research, such as non-local blocks, squeeze-and-excite, stand-alone self-attention, and other related results.

  2. The emergence of new architecture designs. This section will cover the key recent innovations in architectures that depart significantly from traditional CNNs. This will include ViT and variants, Transformer-CNN hybrids, and MLP-based designs. This section will focus primarily on the classic prototype task: supervised image classification. We will discuss the role of transfer learning, scale, and efficiency in the development of these architectures.

  3. Beyond image classification. This section will provide an overview of further developments of Vision Transformers, and other non-convolutional approaches, for applications beyond image classification, both in terms of the input modality and the task. Examples include: object detection and instance segmentation, semantic segmentation, matching and retrieval, point cloud modeling, depth estimation, image processing and generation, and video recognition.

  4. Multi-modal and self-supervised learning. This section will discuss how Transformer architectures bridge the gap between vision domain and natural language processing domain. ViT architectures allow multi-modal learning on different modalities with the Transformer backbones, e.g. CLIP, LiT, VATT. It also unlocks self-supervised visual representation learning following masked-language modeling ideas in the NLP domain, e.g. BEIT and MAE.


Neil Houlsby


Alexander Kolesnikov


Alexey Dosovitskiy


Xiaohua Zhai