PixFoundation CVPR2025
1st Workshop on Pixel-level Vision Foundation Models
12 June 2025
Location: Nashville
motivation
In recent years, foundation models have gained significant traction and success, particularly in natural language processing, exemplified by the GPT series. These models are large-scale, trained on diverse datasets, primarily through self-supervised learning or vision language modelling. Such foundation models were shown to effectively adapt across various downstream tasks, with strong generalization capabilities, especially in zero-shot and few-shot scenarios. However, while language foundation models are well-established, their counterparts in the vision domain and their adoption in various tasks are still in the early-mid stages of development. Despite this, there is growing interest and progress in vision foundation models (VFM). Some of the latest models include those trained using self supervision, such as the DINO series, and those utilizing image/text like CLIP, Flamingo, and Llava. Various pixel-level vision foundation models have also emerged recently such as OMG-LLava or SAM series. Our workshop aims to bring together researchers dedicated to developing and adapting vision foundation models for pixel-level understanding tasks, including image segmentation, video segmentation, tracking, actor-action segmentation, depth estimation, and motion estimation. We will explore major directions in pixel-level understanding with vision foundation models and discuss the opportunities they present, particularly in low-resource settings that could have a positive societal impact. This is especially apparent in marginalized communities that lack access to large-scale labeled datasets tailored to their needs. Additionally, we will discuss the risks associated with these models and explore methods to mitigate them. The workshop features 7 invited talks, mixing emerging and established researchers, along with two poster sessions and selective spotlight presentations. We encourage submissions related to any research or application of pixel-level understanding with vision foundation models.
Invited Speakers
News and Updates:
Jan 06, 2025: Paper submission is open at https://cmt3.research.microsoft.com/PixFoundation2025
Participation:
We encourage submissions that are under one of the topics of interest, but also we welcome other interesting and relevant research for pixel-level understanding with vision foundation models.
Vision foundation models in pixel-level image and video understanding tasks, including: pixel-level grounding and reasoning, image segmentation, referring segmentation and its video counterpart, video segmentation, tracking, actor-action segmentation, depth estimation, motion estimation, etc.
Adaptation, generalization, and prompting of vision foundation models.
Interpretation and benchmarking of vision foundation models and their training data.
Real-world applications with focus on the societal impact of vision foundation models
Papers will be peer-reviewed under a double-blind policy and the submission deadline is the 4th of March 2025. Accepted papers will be presented at the poster session, some as orals, and one paper will be awarded as the best paper.
(Only students/researchers are eligible to participate. Government officials, public sector officials, and employees of entities who do business in the Public Sector are not eligible to participate.)