PixFoundation CVPR2025

1st Workshop on Pixel-level Vision Foundation Models

12 June 2025

Location: Nashville

motivation

In recent years, foundation models have gained significant traction and success, particularly in natural language processing, exemplified by the GPT series. These models are large-scale, trained on diverse datasets, primarily through self-supervised learning or vision language modelling. Such foundation models were shown to effectively adapt across various downstream tasks, with strong generalization capabilities, especially in zero-shot and few-shot scenarios. However, while language foundation models are well-established, their counterparts in the vision domain and their adoption in various tasks are still in the early-mid stages of development. Despite this, there is growing interest and progress in vision foundation models (VFM). Some of the latest models include those trained using self supervision, such as the DINO series, and those utilizing image/text like CLIP, Flamingo, and Llava. Various pixel-level vision foundation models have also emerged recently such as OMG-LLava or SAM series. Our workshop aims to bring together researchers dedicated to developing and adapting vision foundation models for pixel-level understanding tasks, including image segmentation, video segmentation, tracking, actor-action segmentation, depth estimation, and motion estimation. We will explore major directions in pixel-level understanding with vision foundation models and discuss the opportunities they present, particularly in low-resource settings that could have a positive societal impact. This is especially apparent in marginalized communities that lack access to large-scale labeled datasets tailored to their needs. Additionally, we will discuss the risks associated with these models and explore methods to mitigate them. The workshop features 7 invited talks, mixing emerging and established researchers, along with two poster sessions and selective spotlight presentations. We encourage submissions related to any research or application of pixel-level understanding with vision foundation models.

Invited Speakers

Stanford University

Massachusetts Institute of Technology

Koc University

The Allen Institute for AI

University of California San Diego

News and Updates:


Participation:

We encourage submissions that are under one of the topics of interest, but also we welcome other interesting and relevant research for pixel-level understanding with vision foundation models.

Papers will be peer-reviewed under a double-blind policy and the submission deadline is the 4th of March 2025. Accepted papers will be presented at the poster session, some as orals, and one paper will be awarded as the best paper.

(Only students/researchers are eligible to participate. Government officials, public sector officials, and employees of entities who do business in the Public Sector are not eligible to participate.)

Contact:

For questions you can contact us at: pixfoundation_chairs@googlegroups.com



Sponsors: