Program

22.09.2023



 Kick-off for LIMO workshop     9.00 a.m.–9.10 a.m.

    2 Min Teaser for Accepted Papers         9.10 a.m.–9.30 a.m.

Invited Talk - Letitia Parcalabescu       9.30 a.m.–10.15 a.m.

    Coffee Break      10.15 a.m.–10.35 a.m.

    Poster Session     10.35 a.m.–12.15 p.m.

    Invited Talk - Sandro Pezzelle     12:15 p.m.–1.00 p.m.


Invited Talk Title and Abstracts:

Letitia Parcalabescu

Title: About Vision and Language (VL) models: What grounded linguistic phenomena do they understand? To what      extent to they use the image and text modality?

Abstract: In this talk, we will introduce Vision and Language (VL) models which perform very well in VL tasks such as image-sentence alignment or visual question answering. While performance on these tasks is important, task-centered evaluation does not disentangle the fine-grained linguistic capabilities of VL models. Therefore, we present our work on VALSE, a VL benchmark comprising a suite of six specific linguistic phenomena grounded in the visual modality. Our zero-shot experiments for five widely-used pretrained VL models on VALSE –CLIP, LXMERT, VisualBERT, ViLBERT and ViLBERT 12-in-1– suggest that current VL models have considerable difficulty addressing most phenomena.

In the second part, we ask how much a VL model uses the image and text modality in each sample or dataset. To measure the contribution of each modality in a VL model, we present MM-SHAP which we developed and applied in two ways: (1) to compare VL models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided.

Sandro Pezzelle

  Title: From Word Representation to Communicative Success: The Need for Semantic and Pragmatic Analysis in Visually Grounded Language Processing

Abstract: By grounding language into vision, multimodal NLP models have a key advantage over purely textual ones: they can leverage signals in one or both modalities and potentially combine this information in any way required by a given communicative context. This ranges from representing single words taking into account their multimodal semantics to resolving semantically underspecified image descriptions to adapting their way of referring to images to achieve communicative success with a given audience. Moving from word-level semantics to real-life communicative scenarios, I will present work investigating the abilities of current language and vision models to account for and deal with semantic and pragmatic aspects of human multimodal communication. I will argue these abilities are necessary for models to successfully interact with human speakers.

Accepted Papers: