(EVAL-FoMo 2)
June 11 (1 PM - 6 PM)
Location: Room 210 (posters will be at ExHall D #171 - #194)
This workshop focuses on analysis and evaluations to understand and identify emerging visual capabilities and pinpoint visual limits in foundation models.
Visual information processing is being transformed by foundation models. Trained on massive datasets using self-supervised and generative methods, these models exhibit the emergence of sophisticated visual abilities—such as depth perception, object recognition, and part discovery — without explicit programming or supervision. This shift marks a new paradigm where neural models derive visual understanding from the intrinsic structures and patterns present in the data rather than supervisory signals associated with a visual task. Moreover, questions remain about how to systematically analyze and evaluate these emergent capabilities. Recent studies have also highlighted the models' visual limitations, emphasizing the need for innovative evaluation methods to identify these shortcomings. By evaluating and understanding both the capabilities and limits of these models, we can better compare different learning algorithms and architectures in terms of how they represent the visual world.
This workshop centers on two key areas:
Analysis of how foundation models process and understand visual information. For example:
To what extent does a language-trained model (LLM) comprehend the visual world? Is the output of a multimodal LLM (e.g., Pixtral, Multi-modal Llama3.2, LLaVA, ...) grounded in visual input, or is it primarily influenced by the LLM's generative process?
How do visual representations differ across models? Models such as language-supervised vision encoders (e.g., CLIP), purely visual self-supervised models (e.g., DINO, I-JEPA, MAEs, ...), diffusion models, interleaved multi-modal autoregressive generative models (e.g., Chameleon), and visual autoregressive models (e.g., LVM).
To what extent do different models show spatial understanding? Such as relative depth or 3d spatial relationships between entities (e.g. 🔗, 🔗 )
Analyze the visual abstraction ability in foundation models. E.g. Do they perceive shape patterns formed from scene elements? (e.g. 🔗 )
...
Evaluation of foundation models using novel benchmarks that encompass a wide range of visual abilities. These benchmarks are crucial for analyzing and assessing the models’ capabilities while identifying their limitations. This evaluation will help pinpoint areas for improvement and guide the development of new algorithms and architectures to enhance visual representations. There is a pressing need for innovative, ``vision-centric" benchmarks to achieve these goals effectively.
Kindly refer to the list of accepted papers of the previous workshop for more examples on the topic.
Submissions Open: Jan 31 23:59 GMT
Submission Deadline: March 19 23:59 GMT
Submission Portal: URL
Final Decisions: April 18 (List of accepted papers)
The workshop has no proceedings ->
Already submitted papers (e.g. ICCV25 main conference submissions, CVPR25 accepted papers) can be submitted to this workshop.
Submitted papers can be submitted to future venues.
Papers should be within 4 to 8 pages, excluding references and appendix, and follow CVPR25 formatting guidelines. Please note that references and appendix do not count towards the page limit.
There will be no supplementary materials.
The submissions will remain anonymous and will not be made public on OpenReview.
The review process is double-blind.
Accepted papers will be presented as posters during the workshop. Please follow the CVPR poster guidelines for style and layout, but feel free to use standard sizes (e.g., A0, A1) and materials that are most convenient and cost-effective.
We will publish the list of accepted papers.
UC Berkeley, BAIR
University of Oxford