FINE

Fine-grained Evaluation of Vision and Language Models

Co-located with Ai*iA 2024 conference,

November 28th, 2024,

D0.02, piazza Università, 1

Bolzano Bozen, Italy

FINE

Fine-grained Evaluation of Vision and Language Models

Co-located with Ai*iA 2024 conference,

November 28th, 2024, Bolzano Bozen, Italy

Welcome at FINE@Ai*iA

The FINE workshop aims to bring together the Computer Vision and Natural Language Processing communities to brainstorm on the current strengths and weaknesses of VLMs.

Fostering awareness of each other's results will enhance cross-fertilization between the disciplines. In particular, we aim to increase the importance of the evaluation of VLMs which requires expertise in computational approaches to both language and vision.

In the same spirit, we also welcome studies from cognitive neuroscientists on perception in Natural vs. Artificial Intelligence. Current advances in recent AI technologies such as self-driving cars or brain-machine interfaces rely on the assumption that these technologies perceive and represent the world the way humans do. Recent studies have shown that this is not the case and the gap to close is large. Cognitive Neuroscience studies on language or vision deficiencies in humans and the role of multisensory comunication could also be an interesting contribution to the workshop.Cross talk between these research fields will be crucial for advancing various fields of artificial intelligence, where human-like vision and thinking is fundamental.

Finally, VLMs could help make progress on tasks requiring visually grounded reasoning processes. Hence, we welcome contributions in this direction too. The results on CV, NLP, and NeuroAI will help create interesting intersections and viewpoints exchanges which can help discover more efficient and less data-hunger training methods for VLMs, and can raise research questions on human multisensory perception.

Proposing new methods to carry out in-depth evaluation of VLMs will be a crucial step towards the development of trustworthy models.

Topics of interest include but are not limited to fine-grained evaluation on:

Visual Storytelling
Visual Question Answering
Temporal Reasoning in video
Compositionality in multimodal tasks
Face-to-face multimodal dialogues
Text-to-image/video
Visually grounded reasoning
Fine-grained Visual Language Matching
Evaluation Protocols
NeuroAI: human and AI Perception