What is VPT?
Visual Perspective-Taking (VPT) is the ability to understand and predict actions based on another person's viewpoint.
Why it Matters:
Humans develop VPT early in life, which is crucial for avoiding accidents by understanding what others see. But can Vision Language Models do the same?
Our Contribution:
We introduce two new datasets, Isle-Brick and Isle-Dots, to test VPT in VLMs and evaluate 12 commonly used models.
Key Findings:
Performance Drop: All models showed significant drops when VPT was required.
No Strong Correlation: Success in object detection does not predict VPT performance, suggesting current benchmarks are insufficient.
Models struggle with VPT in scenes with more than one person.
We report performance on data slices with varying counts of persons (P), objects (O), and obstacles (S). Models struggle with VPT in scenes with more than one person.
To Cite: