Accepted Papers

  • Training Multi-Exit Architectures via Block-Dependent Losses for Anytime Inference: Dong-Jun Han (KAIST), Jungwuk Park (KAIST), Seokil Ham (KAIST), Namjin Lee (KAIST), Jaekyun Moon (KAIST). [PDF] [Spotlight Video]


Abstract: Multi-exit architecture is a promising solution that can make adaptive predictions via early exits, depending on the current test-time budget which may vary over time in practice (e.g., self-driving cars with dynamically changing speeds). Compared to the previous works where each block is optimized to minimize the losses of all exits simultaneously, we propose a new method for training multi-exit architectures by imposing different objectives to individual blocks. Our key idea is to design block-dependent losses based on grouping and overlapping strategies, which enables each k-th block to focus more on reducing the loss of the adjacent k-th exit while not degrading the prediction performance at later exits. This improves the prediction performance at the earlier exits, making our scheme to be more suitable for low-latency applications with a tight test-time budget. Experimental results on both image classification and semantic segmentation confirm the advantage of our approach for anytime prediction.


  • ComPhy: Compositional Physical Reasoning of Objects and Events from Videos: Zhenfang Chen (MIT-IBM Watson AI Lab), Kexin Yi (Harvard University) , Yunzhu Li (MIT) , Mingyu Ding (The University of Hong Kong), Antonio Torralba (MIT), Joshua Tenenbaum (MIT), Chuang Gan (MIT-IBM Watson AI Lab). [PDF] [Supp]


Abstract: Objects’ motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object’s visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world, whereas humans can effortlessly infer them with limited observations. Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction. In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos. Evaluation results of several state-of-the-art video reasoning models on ComPhy show unsatisfactory performance as they fail to capture these hidden properties. We further propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution into a unified framework. CPL can effectively identify objects’ physical properties from their interactions and predict their dynamics to answer questions.