Universal Representations for Computer Vision

November 24 | BMVC 2022 Workshop | London, UK

Overview

In recent years, deep learning has achieved impressive results for many computer vision tasks. However, the best performance in each task is achieved by designing and learning an independent network per task for each domain/modality, e.g. image classification, depth estimation, audio classification, optical flow estimation. By contrast, humans in the early years of their development develop powerful internal visual representations that are subject to small refinements in response to later visual experience. Once these visual representations are formed, they are universal and later employed in many diverse vision tasks from reading text and recognizing faces to interpreting visual art forms and anticipating the movement of the car in front of us.

The presence of universal representations in computer vision has important implications. First, it means that vision has limited complexity. A growing number of visual domains and tasks can be modelled with a bounded number of representations. As a result, one can use a compact set of representations for learning multiple modalities, domains and tasks, and efficiently share features and computations across them, which is crucial in platforms with limited computational resources such as mobile devices and autonomous cars. Second, as we obtain more complete universal representations, learning new domains and tasks is made easier and performed more efficiently from only a few samples by transfer learning. Third, universal representations enable computer vision models with increased capabilities for scene understanding including semantics, geometry, motion, audio.

Learning universal representations requires addressing several challenges. These include improved architecture design for modelling diverse visual data and interface for allowing effective interaction between them, as well as tackling interference/dominance during optimization. Although existing universal representation learning strategies for architecture design and training algorithms have been explored, key difficulties (such as task interference) associated with learning compact and well-generalised universal representations over modalities, domains and tasks remain. For instance, currently we do not have established models like ResNet or Vision Transformers that can solve multiple problems from various modalities and domains. Our aim is to increase the awareness of the computer vision community in these new and effective solutions are likely to be needed for a breakthrough in learning universal representations.

News

Dec.17: Slides or video of talks are now released! See here for more details.

Sept. 19: The paper submission site is now opened.

Sept. 14: The paper submission site will be opened soon.

Prizes

We will be awarding the best paper award 1 Honor 70 sponsored by Huawei.

Submission

Paper submission: OpenReview (the review process is double-blind).

We invite the submission in two formats, 9 pages and 4 pages papers (excluding references). Papers should be in the BMVC 2022 camera ready format as per the instructions given here. For relevant topics and more details, see Call for Papers.