Method

Our proposed XPL model draws inspiration from developmental psychology theories concerning infancy. As depicted in Fig. 3, the XPL model comprises three key components: (1) a perception module responsible for extracting object-centric representations to facilitate downstream processing, (2) a reasoning module tasked with inferring occluded object states by considering both spatial and temporal contexts, and (3) a dynamics module designed to acquire physical insights and evaluate inference outcomes for occluded objects.

Figure 3. Overview of the XPL model for explanation-based physics learning. The model comprises three key modules: (i) the perception module, responsible for extracting object-centric representation from RGBD videos and segmentation masks; (ii) the reasoning module, utilizing two Transformer networks to infer representations of occluded objects; (iii) the dynamics module, which acquires intuitive physical knowledge and refines reasoning outcomes to align with intuitive physics. Additionally, the inferred object representation can be visualized using the decoder from the perception module, offering a visual explanation of events occurring behind the wall. Wavy curves indicate masking. Refer to the text for comprehensive details.