What is Next in Multimodal Foundation Models?

Link to the Second MMFM  Workshop (CVPR 2024) 

ICCV 2023 Workshop, Paris, France

October 02, 08:30am, Room Paris Sud (P01)

The term “Foundation Models” has been generally used to denote large-scale models (e.g., with billions of parameters) pre-trained on massive-scale datasets, which can be further adapted to a variety of downstream tasks with little or no supervision. In recent years, these big models have taken the AI world by storm, significantly advancing the state of the art in computer vision, natural language processing, speech analysis, and other fields. In particular, multimodal foundation models, which are trained with multiple modalities simultaneously, have shown remarkable success in a wide range of applications, including text to image/video/3D generation, zero-shot classification, cross-modal retrieval, and many others. The purpose of this workshop is to create a forum for discussion on what is next in multimodal foundation models, i.e. what are the paths forward and the fundamental problems that still need to be addressed in this emerging research area. We will bring a diverse set of leaders in the field to deliver talks, present their views, and engage in a discussion with our community on the various facets of multimodal foundation models, including, but not limited to, the models’ design, generalization properties, efficiency, ethics, fairness, scale, and open availability.