January 3rd, 2023

Pretraining Large Vision and Multimodal Models Workshop @ WACV2023

at Waikoloa, Hawaii

Workshop Description

Much machine learning research in the last five years has focused on developing and utilizing large models. As Richard Sutton famously declared in his 2019 Bitter Lesson, “the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective by a large margin.” The field of natural language processing has pioneered the attention-based transformer with impressive results from famously large models approaching multiple trillions of parameters. On the other hand, computer vision has leveraged transfer learning for many years, delivering state of the art performance in countless applications. Given their ability to parallelize sequential information and their novel self-attention mechanism, transformers increasingly impact vision-based research hypothesis. The abundance of interesting methods, data sets, use cases, and models that bridge both modalities continues to accelerate. However, many important questions about the optimal usage, design, and application of these models within computer vision remain unanswered.

Applications that leverage pre-trained vision and multimodal models encompass a variety of scenarios. A few specific examples are:

  • Generating new images based on textual inputs, such as OpenAI’s DALL-E and Google’s Imagen. These applications tend to support user creativity, such as designing new content for media and advertisements.

  • Multimodal search for online applications, such as Amazon’s M5 model, which leverages multiple modalities to provide high performance search.

  • Fashion generation. Given the plethora of data available in e-commerce scenarios for clothing distributors, applications are emerging that leverage large scale multimodal foundation models to develop new fashion designs.

  • Translation enhancement. Researchers today combine visual and textual data for the explicit purpose of enhanced translation, and multimodal foundation models are a critical asset to this.

There is a strong possibility that the future of artificial intelligence research and applications will continue to hinge on the ability of the community to utilize compute efficiently, and at large scales. Distributed computing, such as model and data parallel regimes, are more and more becoming table stakes for groups who wish to pretrain their own models on unique sets of data. In addition, pretraining regimes appear to offer advantages in leveraging unsupervised datasets, which themselves tend to be larger than what experts can label with limited time and resources. However, the value of pretraining regimes themselves, particularly in relation to fine-tuning, appears to have outstanding open questions about its merits and demerits.

Call for Papers

In this workshop we welcome critical and diverse perspectives on the larger landscape of pretraining for vision and language models, including both computer vision and multimodal applications. Topics include, but are not limited to:

  • Evaluation of pre-training methods for vision and/or multimodal applications

  • Parallelization regimes that solve unique problems for vision and/or multimodal scenarios

  • Critical evaluation of pre-train and fine-tune regimes for common vision tasks like image classification, object detection, semantic segmentation, etc

  • Exploring the impact of pre-train / fine-tune regimes for autonomous vehicles, biomedical applications, e-commerce applications, media, large scale online search, etc

  • Detecting and mitigating bias in vision and multimodal models

  • Exploring, evaluating, and extending the pre-training regime for modalities beyond vision and text, such as vision and audio, vision and seismic data, vision and gaming data, vision and seismic data, robotics,

  • Efficiency enhancements for pre-training vision / multimodal models, such as compilation

  • Enhancing the application of pre-trained models beyond their original domain and across modalities, such as methods to align vision and text models trained separately

  • Datasets, statistics, theory of pre-training and fine-tuning regimes and methods for computer vision and combined modalities

  • Research and development on topics above and related areas

Submission Deadlines

  • Paper submission deadline: 25th October, 2022

  • Notification deadline: 9th November, 2022

  • Camera-ready papers due (firm): 19th November, 2022

Industry poster track

If you are working on an application of large-scale modelling for vision and/or multimodal, and would simply like to submit an abstract, we welcome poster submissions!

Diversity statement

This workshop strongly values diverse points of view, organizations, backgrounds, perspectives, and walks of life. Towards that end we have strong representation from five industry organizations and five academic organizations, with multiple participants from diverse backgrounds included.

Keynote Speakers

Abhinav Gupta

Associate Professor at the Robotics Institute at CMU

Phillip Isola

Associate Professor at MIT

Kate Saenko

Associate Professor Boston University, MIT + IBM

R. Manmatha

Sr Principal Scientist at Amazon

The Venue

Waikoloa, Hawaii

Let us know if you'll be attending!