SAVE: Spatial-Attention Visual Exploration
Xinyi Yang*, Chao Yu*, Jiaxuan Gao*, Yu Wang, Huazhong Yang
Tsinghua University
Xinyi Yang*, Chao Yu*, Jiaxuan Gao*, Yu Wang, Huazhong Yang
Tsinghua University
Visual indoor exploration requires agents to explore a room in a limited time. We introduce Spatial Attention Visual Exploration (SAVE), which is based on Active Neural SLAM(ANS). Specifically, we propose a novel RL-based global planner named Spatial Global Policy (SGP) that utilizes spatial information to promote efficient exploration through global goal guidance. SGP has two major components: a transformer-based spatial-attention module encoding spatial interrelation between the agent and different regions to perform spatial reasoning and a hierarchical spatial action selector to infer global goals for fast training. The map representations are aligned through our spatial adjustor. Experiments on the Habitat photo-realistic simulator demonstrate that SAVE outperforms current planning-based methods and RL variants, reducing at least 10% the processing steps, 15% the repeat ratio, and affording an x2 to x4 faster execution time than planning-based methods.
Fig.1: Visual Exploration
Spatial Attention Visual Exploration (SAVE) consists of 4 components:
Neural SLAM
Spatial Adjustor
Spatial Global Policy
Local Planner & Local Policy
The developed SAVE framework is illustrated in Fig.2, highlighting that in the local mapper, the agent passes its pose sensory signals and RGB image to the neural SLAM module to obtain the agent-centric local map and the pose estimation. Moreover, the spatial adjustor incorporates the agent-centric local maps into the global maps and normalizes them, with SGP applying a transformer-based spatial attention module and a spatial action selector to generate a global goal. Next, the local planner calculates the trajectory and the short-term goal on the global map, and finally, the local policy generates an action. Note that this work focuses on global goal selection, and thus we directly reuse the neural SLAM module and the local policy from ANS.
Fig.2: Overview of the Spatial Attention Visual Exploration.
The SGP module, shown in Fig. 3, has 3 parts, i.e., a Global Feature Extractor, a Transformer-based Spatial Attention Module and a Spatial Action Selector.
Global Feature Extractor: We employ a 5-layered CNN to extract spatial features from the global maps and output a 32x8x8 feature map.
Transformer-based Spatial Attention Module (SAM): A simplified transformer block is regarded as a spatial attention module on the 8x8 spatial feature map to extract interrelation between the agent and different regions.
Spatial Action Selector (SAS): We utilize a two-level action space design to ease the learning process and capture spatial information. The spatial action selector outputs a discrete region head to select a grid g from the 8x8 feature map and two continuous point heads to choose the x and y coordinates within the grid chosen by the region head.
Fig.3: SGP workflow involving a global feature extractor, a spatial attention module and a spatial action selector.
The spatial adjustor first splices the agent-centric local map to the agent-view global map, and then we transform the coordinate system into an aligned coordinate system. We also crop the map to enlarge the navigable region, assisting SGP to concentrate on useful regions.
Fig.4: Workflow of the Spatial Adjustor.
To demonstrate our methods' effectiveness, we challenge it against a utility-maximizing method (Utility), a nearest-frontier method (Nearest), an artificial potential field method (APF), and a rapid-exploring-random-tree-based method (RRT). For a fair comparison, all techniques utilize the same modules, except for the global planner module, where we substitute SGP with the methods presented above.
Tab.1 reports the average performance of all competitors on the training maps, revealing that RRT attains the best results among all planning-based solutions, while SAVE outperforms RRT substantially. Specifically, SAVE has an on-bar coverage ratio with RRT but achieves 8% higher AUC-R, requires approximately 40 steps less to explore a 90% coverage, and affords 15% lower repeat ratio, indicating that SAVE efficiently seeks for the unexplored area.
Tab.1: Training results compared with planning-based methods.
The planning-based and SAVE-TD results on unseen maps are reported in Tab.2, highlighting that SAVE-TD achieves the highest AUC-R, i.e., 13+ AUC-R value, the fewest steps for 90% coverage, and the lowest repeat ratio, revealing that SAVE-TD has a better generalization ability.
Tab.2: Evaluation results compared with planning-based methods.
Tab.3 demonstrates that SAVE requires at least half the time of the competitors. Especially compared to RRT, which has the best performance among all planning-based methods, RRT's Run Time is x4 slower than SAVE, indicating that SAVE is more appealing for real-world systems.
Tab.3: Average performance on Run Time.
To evaluate the effect of each component within our architecture, we conduct two ablation experiments on the SAVE modules and three on the SGP components. We remark that ANS is the same as SAVE w.o. SA and SGP.
The training curves illustrated in Fig.5 focus on the Steps and Repeat Ratio, while Tab.4 reveals the final performance. Overall, SAVE performs best, while ANS and w.o. SA degrade the most. Compared with ANS, SAVE has 25+ higher AUC-R, 40+ fewer steps for 90% coverage, and 17% lower repeat ratio. Moreover, w.o. SAM performs poorly due to the lack of spatial information derived from the attention mechanism. Additionally, although w.o. SAS (2 Heads) and w.o. SAS (Multi-Dis) have a comparable final performance thanks to spatial attention. Their training convergence is much slower, suggesting that our hierarchical goal selection encourages a more effective goal search.
Fig.5: Ablation studies on two typical training maps.
Tab.4: Training results compared with RL variants.
Tab. 5 presents the performance on the unseen maps utilizing a training-and-distillation scheme, revealing that SAVE-TD has the best generalization ability with the highest, i.e., 11+ AUC-R value, the fewest steps, and the lowest repeat ratio.
Tab.5: Evaluation results compared with RL variants.
Fig.5: SAVE vs. RRT on trained map Colebrook.
Fig.6: SAVE vs. RRT on unseen map Rancocas.