Dataset

Our X-VoE dataset encompasses four distinct scenarios, covering ball collision, ball blocking, object permanence, and object continuity. To evaluate various intuitive physics principles, each scenario, except object permanence, comprises three distinct settings: predictive, hypothetical, and explicative, as illustrated in Fig. 2. Within each setting, we create 1,000 procedurally generated scene pairs using Unreal Engine 4. Importantly, X-VoE primarily serves as a test suite for evaluating intuitive physics understanding, with no constraints on model training data.

Figure 2. Testing scenarios in X-VoE: ball collision, blocking, object permanence, and object continuity. Within each scenario, frames in a testing video are linked by the same setup identification number (e.g., S1). Black links denote non-surprising videos, while red links indicate surprising ones. Notably, certain videos require explanation to become non-surprising. For example, in the right S2 branch of the object permanence scenario, three cubes on the floor become non-surprising due to preceding observation of two cubes dropping, suggesting a hidden cube behind the wall.