ISLE

Seeing Through Their Eyes:

Evaluating Visual Perspective Taking

in

Vision Language Models

What is VPT?
Visual Perspective-Taking (VPT) is the ability to understand and predict actions based on another person's viewpoint.

Why it Matters:
Humans develop VPT early in life, which is crucial for avoiding accidents by understanding what others see. But can Vision Language Models do the same?

Our Contribution:
We introduce two new datasets, Isle-Brick and Isle-Dots, to test VPT in VLMs and evaluate 12 commonly used models.

Key Findings:

Performance Drop: All models showed significant drops when VPT was required.
No Strong Correlation: Success in object detection does not predict VPT performance, suggesting current benchmarks are insufficient.
Models struggle with VPT in scenes with more than one person.

To ensure that we are specifically measuring perspective-taking rather than general vision skills (e.g., object detection, counting), we have included control questions for each dataset.

In this example, we observe that the model incorrectly states that the mini figure can see the umbrella, demonstrating its inability to accurately take the correct perspective.

Compared to the control task that does not require perspective-taking, the models suffer an average 32% and 38% drop in performance on the Isle_Bricks dataset and the Isle-Dots dataset, respectively. The performance on the VPT task is often close to random chance.

We report performance on data slices with varying counts of persons (P), objects (O), and obstacles (S). Models struggle with VPT in scenes with more than one person.

To Cite:

Page updated

Google Sites

Report abuse