Method

Our proposed XPL model draws inspiration from developmental psychology theories concerning infancy. As depicted in Fig. 3, the XPL model comprises three key components: (1) a perception module responsible for extracting object-centric representations to facilitate downstream processing, (2) a reasoning module tasked with inferring occluded object states by considering both spatial and temporal contexts, and (3) a dynamics module designed to acquire physical insights and evaluate inference outcomes for occluded objects.