Learning Efficient Multi-Agent Cooperative Visual Exploration
Accepted by ECCV
Chao Yu*, Xinyi Yang*, Jiaxuan Gao*, Huazhong Yang, Yu Wang, Yi Wu
Tsinghua University
Accepted by ECCV
Chao Yu*, Xinyi Yang*, Jiaxuan Gao*, Huazhong Yang, Yu Wang, Yi Wu
Tsinghua University
Fig. 1 Multi-agent Visual Exploration
We consider the task of visual indoor exploration with multiple agents, shown in Fig.1, where the agents need to cooperatively explore the entire indoor region using as few steps as possible. We extend the state-of-the-art single-agent RL solution, Active Neural SLAM (ANS), to the multi-agent setting by introducing a novel RL-based global-goal planner, Multi-agent Spatial Planner (MSP), which leverages spatial information from each individual agent in an end-to-end manner and effectively guides the agents to navigate towards different spatial goals with high exploration efficiency. Our solution, Multi-Agent Active Neural SLAM (MAANS) substantially outperforms different planning-based methods and various RL baselines in the photo-realistic physical testbed, Habitat.
Multi-Agent Active Neural SLAM (MAANS) consists of 4 components:
Neural SLAM
Map Refiner & Map Merger
Multi-agent Spatial Planner
Local Planner & Local Policy
The overview of MAANS is demonstrated in Fig. 2, where each agent is presented in a modular structure. When each agent receives the visual and pose sensory signals from the environment, the Neural SLAM module corrects the sensor error and performs SLAM in order to build a top-down 2D occupancy map that includes explored area and discovered obstacles. Then we use a Map Refiner to rotate each agent’s egocentric local map to a global coordinate system. We augment these refined maps with each agent’s trajectory information and feed these spatial inputs along with other agent-specific information to our core planning module, Multi-agent Spatial Planner (MSP) to generate a global goal as the long-term navigation target for each individual agent. To effectively reach a global goal, the agent first plans a path to this long-term goal in a manually merged global map using Local Planner and generates a sequence of short-term sub-goals. Finally, given a short-term sub-goal, a Local Policy outputs the final navigation action based on the visual input and the relative spatial distance as well as the relative angle to the sub-goal.
Fig.2 Overview of Multi-Agent Active Neural SLAM (MAANS)
The MSP module, shown in Fig. 3, has 3 parts, i.e., a CNN-based Feature Extractor, a Spatial-TeamFormer for representation learning and an Action Generator.
CNN-based Feature Extractor: We adopt a weight-sharing CNN network to process each agent’s input map, which produces a G × G feature map with D channels. Besides CNN spatial maps, we also introduce additional features, including agent-specific embeddings of its current position and grid features, i.e., the embeddings of the relative coordinate of each grid to the agent position as well as the embedding of the previous global goal.
Spatial-TeamFormer: A Spatial-Teamformer block consists of two layers, i.e., an Individual Spatial Encoder for capturing spatial features for each agent, and a Team Relation Encoder for reasoning cross agents. Individual Spatial Encoder focuses only on spatial information by performing a spatial self-attention over each agent’s own G × G spatial map without any cross-agent computation. By contrast, Team Relation Encoder completely focuses on capturing team-wise interactions without leveraging any spatial information. We can further stack multiple Spatial-TeamFormer blocks for even richer interaction representations.
Action Generator: The Action Generator is the final part of MSP, which outputs a long-term global goal over the reconstructed map. Since spatial-TeamFormer produces a total of N rich spatial representation, which can be denoted as N × G × G$ , we take the first G × G grid, which is the feature map of the current agent, to derive a single team-size-invariant representation.
Fig.3 Workflow of Multi-agent Spatial Planner (TSP), including a CNN-based Feature Extractor, a Spatial-TeamFormer for representation learning and an Action Generator.
Map Refiner: The map refiner first composes all the past agent-centric local maps to recover the agent-centric global map. Then, we transform the coordinate system based on the pose estimates to normalize the global maps from all the agents w.r.t. the same coordinate system.
Map Merger: After obtaining N enlarged global maps via the map refiner, the map merger simply integrates all these maps by applying a max-pooling operator for each pixel location.
Fig.4: Computation workflow of map refiner and map merger
Our experiments are conducted in selected scenes from the Gibson dataset using the Habitat simulator. The goal of the task is to coordinate a team of agents to explore unknown areas as fast as possible within a limited horizon. We report 3 behavior statistics, including the Coverage Ratio, number of Steps to reach a 90% coverage, and the Mutual Overlap, only for analysis purposes. We use 3 single-agent planning-based baselines, which are RRT, Utility and Nearest, as well as 3 multi-agent planning-based baselines, which are APF, WMA-RRT and Voronoi.
We report the performance of MAANS and all the baseline methods with a fixed team size of N = 2 agents on both training maps and unseen testing maps in Tab. 1. MAANS outperforms all the planning-based baselines with a clear margin in every evaluation metric on both training and testing scenes. Except for the Mutual Overlap on the testing scenes where the performance is slightly worse than Voronoi, MAANS still outperforms all planning-based baselines in Steps and Coverage metrics on training and testing scenes. More concretely, MAANS reduces 20.56% exploration steps on training scenes and 7.99% exploration steps on testing scenes than the best planning-based competitor.
Table 1: Performance of MAANS and baselines with a fixed size of N=2 agents on both training and unseen testing scenes.
We apply the policies trained with N = 2 agents to the scenarios of N = 3, 4 agents respectively. The zero-shot generalization performance of MAANS compared with all the baselines on training and testing scenes is shown in Tab. 2. Note that experiments on testing scenes are extremely challenging since MAANS is never trained with N=3, 4 team sizes on testing scenes. Although MAANS is only trained on the team size of 2 on training scenes, it achieves much better performance than the best planning-based methods with every novel team size on training scenes (17.56% fewer Steps with N=3 and 18.36% fewer Steps with N=4) and comparable performance on testing scenes (<3 more Steps with N=3, 4).
Table 2: We compare the zero-shot generalization performance of MAANS to novel team sizes on training and unseen testing scenes with baselines.
We summarize the zero-shot generalization performance of MAANS compared with the planning-based baselines on training scenes in Tab. 3. We remark that MAANS trained by fixed team size N = 2. We use "N1 ⇒ N2" to denote that each episode starts with N1 agents and the team size immediately switches to N2. In scenarios where the team size increases, though RRT still performs the best among the planning-based baselines, MAANS outperforms RRT with a clear margin for 25 fewer Steps in 2 ⇒ 4 setting and 35 fewer Steps in others. While as the team size decreases, the performance between MAANS and classical methods varies more widely, which shows that MAANS has a 35 fewer Steps in the comparison of the best baseline, RRT. Besides, MAANS has the best result in the metrics of mutual overlap ratio and coverage. We consider the case is more challenging where the team size decreases and the share information gain becomes less shapely, therefore the baselines could not adjust the strategy immediately.
Table 3: Performance of different methods with a varying team size on training scenes.
We summarize the zero-shot generalization performance of MAANS compared with the planning-based baselines on testing scenes in Tab.4. In cases where the team size increases, MAANS still produces substantially better performances, especially Steps, which suggests that MAANS has the capability to adaptively adjust its strategy. Regarding the cases where the team size decreases, MAANS shows slightly worse performance than the best single-agent baseline, RRT, and the best multi-agent baseline, Voronoi. We remark that decreasing the team size is particularly challenging since the absence of some agents might immediately leave a large part of the house unexplored and consequently, the team should immediately update their original plan with drastically different goal assignments.
Table 4: Performance of different methods with a varying team size on testing scenes.
We implemented several planning-based methods as baselines, including Nearest, Utility, APF, RRT and additionally a voronoi-based method plus a multi-agent variant of RRT, WMA-RRT. While these methods are respectively tested under environments with ideal assumptions, we empirically found some of them do not work well under our setting with realistic perception and noises.
Sensitive to realistic noise. Nearest, Utility, APF and the voronoi-based method all apply cell-level goal searching. We empirically found that they share a same failure mode, that is, trying to approach a cell that is wrongly estimated as explorable, even when the mapping only has minor cell-level error. The planning-based baseline RRT do not perform cell-level planning but in a geometric way, thus it’s robust to mapping noise.
Naive estimation of future value. Both Utility and RRT select candidate frontier with highest utility, which is computed as the number of unexplored cells within a small distance. Such estimation of future values is very short-sighted. In contrast, MAANS, by using neural network, could perform more complicated inference about utility of exploring one part of the map.
Strict restriction over robot behaviors. Among the planning-based baselines, both APF and WMA-RRT adopt a formally-designed cooperation system. However their cooperation schemes both imposes great restriction to agents’ behaviors. APF includes resistance force among agents to avoid duplicated exploration. In WMA-RRT, agents share a same tree and follow a locking-and-searching method to do cooperation. However, in cases where it’s better for agents to go through the same place simultaneously, which is pretty common, the resistance force in APF and the locking mechanism in WMA-RRT prevent agents from the optimal strategy. Meanwhile, WMA-RRT restricts agents to follow pathes in the shared tree, losing the benefit of accidential coverage brought by random search.
We provide case studies with corresponding graphics in following part. The red angles indicate the agents and their direction. The blue points indicate the global goals selected by the agents.
Sensitive to Realistic Noise
Utility, Nearest, APF and the voronoi-based method performs cell-level goal searching, the above picture demonstrates a case when the agent selects a cell that is estimated as "unexplored" because of mapping error.
Faulty Estimation of Value
In the above situtation, Utility selects a cell at the corner of the room as a global goal since it's nearby space is unexplored. However, with prior knowledge about building structures, one should infer that such point should not be chosen.
Poor Cooperation
RRT performs planning independently, it's possible that agents choose close frontiers as goals, leading to duplicated exploration.
APF Resistance Force
APF achieves cooperation by introducing resistance force among agents. In above situation, the yellow lines demonstrate agents' pathes to their corresponding global goal. Due to the resistance force, the agent on the left is forced to choose a suboptimal global goal instead of frontiers in the more promising part indicated by the purple triangle.
WMA-RRT Locking Mechanism
In WMA-RRT, agents cooperatively maintain a tree and use a locking mechanism to avoid duplicated exploration. The above picture shows a case where agent on the right has to stay where it is since the path is locked by another agent.
Incompatible with Active Mapping
Specially, WMA-RRT algorithm is inheriently incompatible with the active mapping process. When new cells are classified as obstacles, some nodes and edges in the tree would become invalid and WMA-RRT could not adapt to such change. In the above picture, agents still try to approach selected points even the tree edges and nodes are no longer valid.
MAANS vs. RRT with 2 agents on trained middle map Colebrook
MAANS vs. RRT with 2 agents on unseen middle map Haxtun
MAANS vs. RRT zero-shot to 4 agents on trained middle map Delton
MAANS vs. RRT increasing team size (2 to 4 agents) on trained middle map Delton
MAANS vs. RRT decreasing team size (4 to 2 agents) on trained middle map Delton