X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events

Bo Dai1,2, Linge Wang3, Baoxiong Jia2, Zeyu Zhang2, Song-Chun Zhu1,2,3, Chi Zhang2,🖂, Yixin Zhu4,🖂

1School of Intelligence Science and Technology, Peking University 2Beijing Institute for General Artificial Intelligence

3Department of Automation, Tsinghua University 4Institute for Artificial Intelligence, Peking University

ICCV 2023 Oral

[arXiv] [github] [datasets] [PKU CoRe Lab]

Abstract

Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents’ grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models’ comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model’s alignment with human commonsense when tested against X-VoE. A remarkable feature is our model’s ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings’ implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities.

Figure 1. Evaluation settings in the ball blocking exemplar scenario of X-VoE. The explanation video illustrates potential hidden dynamics. Circles denote no surprise, and exclamation marks indicate surprise. In the predictive setup (S1), a solvable pair is presented without requiring explanation: predicting observed entities' dynamics suffices to reason about the outcome. In the hypothetical setup (S2), perceiving the direction of outgoing balls might lead to surprise, yet alternate explanations exist—e.g., a hidden blocker behind the wall causing ball rebound. However, a random agent's scores show negligible disparity, necessitating the explicative setup (S3) to discern surprises, demanding explanatory ability absent in predictive-only or random agents.

Demo

Page updated

Google Sites

Report abuse