January 3rd, 2023
Pretraining Large Vision and Multimodal Models Workshop @ WACV2023
at Waikoloa, Hawaii
Workshop Description
Much machine learning research in the last five years has focused on developing and utilizing large models. As Richard Sutton famously declared in his 2019 Bitter Lesson, “the biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective by a large margin.” The field of natural language processing has pioneered the attention-based transformer with impressive results from famously large models approaching multiple trillions of parameters. On the other hand, computer vision has leveraged transfer learning for many years, delivering state of the art performance in countless applications. Given their ability to parallelize sequential information and their novel self-attention mechanism, transformers increasingly impact vision-based research hypothesis. The abundance of interesting methods, data sets, use cases, and models that bridge both modalities continues to accelerate. However, many important questions about the optimal usage, design, and application of these models within computer vision remain unanswered.
Applications that leverage pre-trained vision and multimodal models encompass a variety of scenarios. A few specific examples are:
Generating new images based on textual inputs, such as OpenAI’s DALL-E and Google’s Imagen. These applications tend to support user creativity, such as designing new content for media and advertisements.
Multimodal search for online applications, such as Amazon’s M5 model, which leverages multiple modalities to provide high performance search.
Fashion generation. Given the plethora of data available in e-commerce scenarios for clothing distributors, applications are emerging that leverage large scale multimodal foundation models to develop new fashion designs.
Translation enhancement. Researchers today combine visual and textual data for the explicit purpose of enhanced translation, and multimodal foundation models are a critical asset to this.
There is a strong possibility that the future of artificial intelligence research and applications will continue to hinge on the ability of the community to utilize compute efficiently, and at large scales. Distributed computing, such as model and data parallel regimes, are more and more becoming table stakes for groups who wish to pretrain their own models on unique sets of data. In addition, pretraining regimes appear to offer advantages in leveraging unsupervised datasets, which themselves tend to be larger than what experts can label with limited time and resources. However, the value of pretraining regimes themselves, particularly in relation to fine-tuning, appears to have outstanding open questions about its merits and demerits.
Call for Papers
In this workshop we welcome critical and diverse perspectives on the larger landscape of pretraining for vision and language models, including both computer vision and multimodal applications. Topics include, but are not limited to:
Evaluation of pre-training methods for vision and/or multimodal applications
Parallelization regimes that solve unique problems for vision and/or multimodal scenarios
Critical evaluation of pre-train and fine-tune regimes for common vision tasks like image classification, object detection, semantic segmentation, etc
Exploring the impact of pre-train / fine-tune regimes for autonomous vehicles, biomedical applications, e-commerce applications, media, large scale online search, etc
Detecting and mitigating bias in vision and multimodal models
Exploring, evaluating, and extending the pre-training regime for modalities beyond vision and text, such as vision and audio, vision and seismic data, vision and gaming data, vision and seismic data, robotics,
Efficiency enhancements for pre-training vision / multimodal models, such as compilation
Enhancing the application of pre-trained models beyond their original domain and across modalities, such as methods to align vision and text models trained separately
Datasets, statistics, theory of pre-training and fine-tuning regimes and methods for computer vision and combined modalities
Research and development on topics above and related areas
Submission Deadlines
Paper submission deadline: 25th October, 2022
Notification deadline: 9th November, 2022
Camera-ready papers due (firm): 19th November, 2022
Industry poster track
If you are working on an application of large-scale modelling for vision and/or multimodal, and would simply like to submit an abstract, we welcome poster submissions!
Diversity statement
This workshop strongly values diverse points of view, organizations, backgrounds, perspectives, and walks of life. Towards that end we have strong representation from five industry organizations and five academic organizations, with multiple participants from diverse backgrounds included.
Keynote Speakers
Abhinav Gupta
Associate Professor at the Robotics Institute at CMU
Phillip Isola
Associate Professor at MIT
Kate Saenko
Associate Professor Boston University, MIT + IBM
R. Manmatha
Sr Principal Scientist at Amazon
The Venue
Waikoloa, Hawaii
Let us know if you'll be attending!