MEWL: Few-shot multimodal word learning with referential uncertainty

Guangyuan Jiang1,✉️, Manjie Xu2,3, Shiji Xin1, Wei Liang3, Yujia Peng1,2, Chi Zhang2,✉️, Yixin Zhu1,✉️

1Peking University 2Beijing Institute for General Artificial Intelligence 3Beijing Institute of Technology

ICML 2023

Overview

Without explicit feedback, humans can rapidly learn the meaning of words. Children can acquire a new word after just a few passive exposures, a process known as fast mapping. This word learning capability is believed to be the most fundamental building block of multimodal understanding and reasoning. Despite recent advancements in multimodal learning, a systematic and rigorous evaluation is still missing for human-like word learning in machines. To fill in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess how machines learn word meaning in grounded visual scenes.

MEWL covers human’s core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot benchmark suite consisting of nine tasks for probing various word learning capabilities. These tasks are carefully designed to be aligned with the children’s core abilities in word learning and echo the theories in the developmental literature. By evaluating multimodal and unimodal agents’ performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning. We further discuss these differences between humans and machines and call for human-like few-shot word learning in machines.

Motivation: How children learn words

Learn words from cross-situational information from multiple contexts.
Leverage semantic and syntactic cues to bootstrap novel word learning.
Comprehend word meanings with pragmatics, a social account of word learning with the help of other speakers.

Word learning facilitates critical downstream tasks:
- learning new object categories
- forming abstractions of conceptual structures
- making generalizations
- developing the ability to communicate
It is the very first step in language learning:
- born multi-modal: children need to ground every word to visual perceptions and relations.
- closely related to the learning of concepts.
- children can use many clues to facilitate word learning: cross-situational statistics, bootstrapping, and pragmatic word learning.

Our contributions

Highlight the significance of human-like word learning in machines.
Devising and benchmarking MEWL’s nine tasks, all directly inspired by the established findings in human word learning, for probing and comparing few-shot word learning capabilities in machines and humans.

MEWL Tasks

Dataset

We design nine unique tasks in MEWL to comprehensively evaluate alignment between humans and machines:

shape color material object composite relation bootstrap number pragmatic

They are designed for:

Learn novel words or phrases that represent basic object attributes (i.e., shape, color and material), the objects per se (i.e., object), and the composition of basic attributes (i.e., composite).
Use familiar words to bootstrap learning novel (spatial) relational words (i.e., relation) or vice versa (i.e., bootstrap).
Learn counting and numerical words from one to six (i.e., number).
Use pragmatic cues to learn novel words by assuming the speaker is informative (i.e., pragmatic).

Each few-shot problem is an episode consisting of seven images, each containing a few randomly positioned objects and has an utterance consisting of a novel word/phrase describing the image.
After seeing context images, a query image is presented with five candidate utterances, with one answer that correctly describes the scene.

27,000 problems for training
5,400 problems for validation
5,400 problems for testing
Evenly divided among the nine tasks

Baselines

Models:

CLIP
Aloe
Flamingo-1.1B
BERT
GPT-3.5
Human

Captioning for unimodal models:

Discussion:

- Multimodal vs. unimodal
- Humans vs. machines
  - Efficacy of MEWL
  - Failure of learning models
  - Why should machines have human-like word learning capabilities?

Citation

If you find MEWL useful, please cite us:

@inproceedings{jiang2023mewl,

title={MEWL: Few-shot multimodal word learning with referential uncertainty},

author={Jiang, Guangyuan and Xu, Manjie and Xin, Shiji and Liang, Wei and Peng, Yujia and Zhang, Chi and Zhu, Yixin},

booktitle={ICML},

year={2023}

}

Page updated

Google Sites

Report abuse