A novel benchmark to leverage the complementarity of language and vision
BD2BB is a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we design a multiple-choice task where only one choice is correct. Third, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.
Given a natural IMAGE, 5 unique participants provided: 1) an INTENTION describing how they might feel/behave if they were in that situation; 2) an ACTION describing what they would do based on that feeling/behavior. Intentions and actions were typed in free form by participants in two separate text boxes. By instructions, their sentences had to complete the provided opening words If I. . . and I will. . . , respectively. One <intention, action> tuple by one participant is shown below.
Multiple-choice problem with 5 candidate options [data & code]
Given an IMAGE depicting, e.g., a tennis player during a match and the INTENTION “If I have tons of energy”, the task involves choosing, from a list of 5 candidate actions, the TARGET ACTION that unequivocally applies to the combined multimodal input: “I will play a game of tennis with the man”. Two decoy actions were selected to be plausible based on the intention only (language); other two, to be plausible based on the image only (vision). One sample is given below.
Relatively easy task for humans (~80% accuracy) challenging for SOTA models (~60%) [data & code]
Multimodal integration is key: Language + Vision (multimodal) outperforms unimodal models
Pre-trained is better: Only pre-trained models (grey) neatly outperform baseline models
Models are far from human performance: Best-performing model (LXMERT, multimodal) is ~17% lower than humans
*Greta and Eleonora worked at the project as MSc students at CIMeC, University of Trento
Sponsors & Acknowledgments
The project Be Different to Be Better has been financed by SAP SE (DE-2018-019) -- University of Trento.
Thanks to Moin Nabi and Tassilo Klein (SAP AI Research) for the valuable discussion in the early stages of the project. We thank all the participants who took part in the human evaluation, and the attendees of the SiVL2018 workshop for their feedback on a preliminary version of the task and data collection pipeline. We kindly acknowledge the support of NVIDIA Corporation with the donation of the GPUs used in our research. Sandro is funded by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 819455).
Data & Code
Contact s dot pezzelle at uva dot nl to get more information on the project