09:16 – 09: 28 Oral Presentation
09:16 - 09:28 : Sign Language Sense Disambiguation
10:15 - 10:27 : CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
12:15 - 12:27 : Investigating Occupational Stereotypes on Multimodal text-to-image Models: a: Linguistic Analysis
12:28 - 12:40 : Evaluating Semantic Relations in Predicting Textual Labels for Images of Abstract and Concrete Concepts
Note: Each oral presentation will be 8 minutes with an additional 4 minutes for the discussion.
Invited Talk Title and Abstracts
Anna Rohrbach
Title: Learning from Language
Abstract: Humans rely on multiple modalities to perceive the world and communicate with each other, most importantly using vision and language. We also use language to teach each other about new things or point out mistakes. Recently we have witnessed a paradigm shift toward leveraging language *for* vision. Language is being used to enhance visual models by, e.g., enabling zero-shot capabilities, improving generalization or mitigating bias. I will present my recent efforts towards building vision models that can ingest language advice to improve their behavior on a range of scenarios.
Ece Takmaz
Title: Human Signals in the Integration of Vision and Language in Deep Neural Networks
Abstract: There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and whether image representations encoded by pretrained vision encoders can capture such variation. This work opens up the possibility of using pretrained multimodal encoders to quantify patterns in human data and shed light on the underlying cognitive mechanisms, as well as identifying the shortcomings of such encoders.
Submission Talks
Sign Language Sense Disambiguation (Tanalp Agustoslu, Jana Grimm, Oliver Kraus, Miriam Winkler)
CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding (Ivana IB Beňová, Michal Gregor, Albert Gatt)
Investigating Occupational Stereotypes on Multimodal text-to-image Models: A Linguistic Analysis (Hermine Kleiner, Lena Altinger, Sebastian Loftus, Sarah Anna Uffelmann)
Evaluating Semantic Relations in Predicting Textual Labels for Images of Abstract and Concrete Concepts (Tarun Tater, Sabine Schulte im Walde, Diego Frassinelli)