SAVE: Spatial-Attention Visual Exploration

Xinyi Yang*, Chao Yu*, Jiaxuan Gao*, Yu Wang, Huazhong Yang

Tsinghua University

1.Visual Exploration

2. Spatial-Attention Visual Exploration

2.1 Spatial Global Policy (SGP)

2.2 Spatial Adjustor (SA)

3. Results

3.1 Comparison with Planning-based Methods

3.1.1 Training Performance

3.1.2 Generalization Performance

3.1.3 Execution Efficiency

3.2 Ablation Study

3.2.1 Training Performance

3.2.2 Generalization Performance

4. Demo Videos

4.1 Training Perfomence

4.2 Evaluation Perfomence

1.Visual Exploration

Visual indoor exploration requires agents to explore a room in a limited time. We introduce Spatial Attention Visual Exploration (SAVE), which is based on Active Neural SLAM(ANS). Specifically, we propose a novel RL-based global planner named Spatial Global Policy (SGP) that utilizes spatial information to promote efficient exploration through global goal guidance. SGP has two major components: a transformer-based spatial-attention module encoding spatial interrelation between the agent and different regions to perform spatial reasoning and a hierarchical spatial action selector to infer global goals for fast training. The map representations are aligned through our spatial adjustor. Experiments on the Habitat photo-realistic simulator demonstrate that SAVE outperforms current planning-based methods and RL variants, reducing at least 10% the processing steps, 15% the repeat ratio, and affording an x2 to x4 faster execution time than planning-based methods.

Fig.1: Visual Exploration

2. Spatial-Attention Visual Exploration

Spatial Attention Visual Exploration (SAVE) consists of 4 components:

Neural SLAM
Spatial Adjustor
Spatial Global Policy
Local Planner & Local Policy

The developed SAVE framework is illustrated in Fig.2, highlighting that in the local mapper, the agent passes its pose sensory signals and RGB image to the neural SLAM module to obtain the agent-centric local map and the pose estimation. Moreover, the spatial adjustor incorporates the agent-centric local maps into the global maps and normalizes them, with SGP applying a transformer-based spatial attention module and a spatial action selector to generate a global goal. Next, the local planner calculates the trajectory and the short-term goal on the global map, and finally, the local policy generates an action. Note that this work focuses on global goal selection, and thus we directly reuse the neural SLAM module and the local policy from ANS.

Fig.2: Overview of the Spatial Attention Visual Exploration.

2.1 Spatial Global Policy (SGP)

The SGP module, shown in Fig. 3, has 3 parts, i.e., a Global Feature Extractor, a Transformer-based Spatial Attention Module and a Spatial Action Selector.

Global Feature Extractor: We employ a 5-layered CNN to extract spatial features from the global maps and output a 32x8x8 feature map.
Transformer-based Spatial Attention Module (SAM): A simplified transformer block is regarded as a spatial attention module on the 8x8 spatial feature map to extract interrelation between the agent and different regions.
Spatial Action Selector (SAS): We utilize a two-level action space design to ease the learning process and capture spatial information. The spatial action selector outputs a discrete region head to select a grid g from the 8x8 feature map and two continuous point heads to choose the x and y coordinates within the grid chosen by the region head.

Fig.3: SGP workflow involving a global feature extractor, a spatial attention module and a spatial action selector.

2.2 Spatial Adjustor (SA)

The spatial adjustor first splices the agent-centric local map to the agent-view global map, and then we transform the coordinate system into an aligned coordinate system. We also crop the map to enlarge the navigable region, assisting SGP to concentrate on useful regions.

Fig.4: Workflow of the Spatial Adjustor.

3. Results

3.1 Comparison with Planning-based Methods

To demonstrate our methods' effectiveness, we challenge it against a utility-maximizing method (Utility), a nearest-frontier method (Nearest), an artificial potential field method (APF), and a rapid-exploring-random-tree-based method (RRT). For a fair comparison, all techniques utilize the same modules, except for the global planner module, where we substitute SGP with the methods presented above.

3.1.1 Training Performance

Tab.1 reports the average performance of all competitors on the training maps, revealing that RRT attains the best results among all planning-based solutions, while SAVE outperforms RRT substantially. Specifically, SAVE has an on-bar coverage ratio with RRT but achieves 8% higher AUC-R, requires approximately 40 steps less to explore a 90% coverage, and affords 15% lower repeat ratio, indicating that SAVE efficiently seeks for the unexplored area.

Tab.1: Training results compared with planning-based methods.

3.1.2 Generalization Performance

The planning-based and SAVE-TD results on unseen maps are reported in Tab.2, highlighting that SAVE-TD achieves the highest AUC-R, i.e., 13+ AUC-R value, the fewest steps for 90% coverage, and the lowest repeat ratio, revealing that SAVE-TD has a better generalization ability.

Tab.2: Evaluation results compared with planning-based methods.

3.1.3 Execution Efficiency

Tab.3 demonstrates that SAVE requires at least half the time of the competitors. Especially compared to RRT, which has the best performance among all planning-based methods, RRT's Run Time is x4 slower than SAVE, indicating that SAVE is more appealing for real-world systems.

Tab.3: Average performance on Run Time.

3.2 Ablation Study

To evaluate the effect of each component within our architecture, we conduct two ablation experiments on the SAVE modules and three on the SGP components. We remark that ANS is the same as SAVE w.o. SA and SGP.

3.2.1 Training Performance

The training curves illustrated in Fig.5 focus on the Steps and Repeat Ratio, while Tab.4 reveals the final performance. Overall, SAVE performs best, while ANS and w.o. SA degrade the most. Compared with ANS, SAVE has 25+ higher AUC-R, 40+ fewer steps for 90% coverage, and 17% lower repeat ratio. Moreover, w.o. SAM performs poorly due to the lack of spatial information derived from the attention mechanism. Additionally, although w.o. SAS (2 Heads) and w.o. SAS (Multi-Dis) have a comparable final performance thanks to spatial attention. Their training convergence is much slower, suggesting that our hierarchical goal selection encourages a more effective goal search.

Fig.5: Ablation studies on two typical training maps.

Tab.4: Training results compared with RL variants.

3.2.2 Generalization Performance

Tab. 5 presents the performance on the unseen maps utilizing a training-and-distillation scheme, revealing that SAVE-TD has the best generalization ability with the highest, i.e., 11+ AUC-R value, the fewest steps, and the lowest repeat ratio.

Tab.5: Evaluation results compared with RL variants.

4. Demo Videos

4.1 Training Perfomence

Fig.5: SAVE vs. RRT on trained map Colebrook.

4.2 Evaluation Perfomence

Fig.6: SAVE vs. RRT on unseen map Rancocas.

Page updated

Google Sites

Report abuse