Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

Paper

Appendix

Video

Multi-agent Exploration

1. Asynchronous Coordination Explorer

1.1 Async-MAPPO

1.2 Action-delay Randomization

1.3 Multi-Tower-CNN-based Policy

2. Results

2.1 Grid-Based Scenario

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible, as shown in Fig. 1. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i.e., every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. In the real world, It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient.

Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting, and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-gent communication through low-dimensional CNN features.

We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

Local Map

Merged Map

Fig.1 Schematic diagram of multi-agent exploration task map

1. Asynchronous Coordination Explorer

As shown in Fig.2, ACE consists of 3 major components:

Async-MAPPO for MARL training
Action-delay randomization for zero-shot generalization.
Multi-tower-CNN-based policy representation for efficient communication

Fig. 2 Overview of Asynchronous Coordination Explorer (ACE)

1.1 Async-MAPPO

In Async-MAPPO, different agents may not take actions at the same time, which makes it infeasible for the trainer to collect transitions in the original synchronized manner. Therefore, we allow each agent to store its transition data in a separate cache and periodically push the cached data to the centralized data buffer. The pseudo-code of Async-MAPPO is shown in Alg.1. Compared with the setting of MAPPO, both policy execution, and data collection are not necessarily time-aligned among different agents, and we implement the asynchronous action-making and replay buffer.

1.2 Action-delay Randomization

In reality, trouble such as hardware failure and network blocking would cause agents' delayed action execution, causing a large gap from simulation to reality. To reduce this gap, as shown in Fig.3. we apply action-delay randomization during training to simulate the real-world challenge of action delay.

Fig.3 Asynchronous action making

1.3 Multi-Tower-CNN-based Policy

As shown in Fig.2, the Multi-tower-CNN-based Policy (MCP) is utilized to generate the global goal. MCP consists of 3 parts, i.e., a CNN-based local feature extractor, an attention-based relation encoder, and an action decoder.

The local feature extractor: a weight-sharing 3-layer CNN that can extract a G × G × 4 feature map from each agent's S × S × 7 local information, which greatly reduces communication traffic.
Attention-based relation encoder: aggregates the extracted feature maps from different agents to better capture the intra-agent interactions. More importantly, adopt a simplified Transformer block, where the multi-head cross-attention to derive a single team-size-invariant representation of size G × G × 4, as shown in Fig.4.
The action encoder: predicts the agent's policy from the aggregated representation as a multi-variable Categorical distribution to select a global goal from a plane.

Fig.4 Workflow of Multi-tower-CNN-based Policy (MCP), including a CNN-based local feature extractor, a relation encoder and an action decoder.

2. Results

Baseline: We consider 4 popular planning-based competitors, including a utility-maximizing method Utility, a search-based nearest-frontier method Nearest, a rapid-exploring-random-tree-based method RRT, and an artificial potential field method APF.

Metric: The most important metric in our experiment is Time, which is the running time for the agents to reach a C% coverage ratio. We report wall-clock time in the real world, and report an estimated statistical running time in simulation: turning left or right takes 0.5s; stepping forward takes 1s. Policy inference time is fixed to 0.1s for both RL and planning-based methods thus the results can better reflect the difference between asynchronous and synchronous settings.

Setting: In synchronous action-making cases, agents perform action-making at the same time and wait for all other agents to finish. In asynchronous action-making cases, agents do not wait for others and perform both macro and atomic actions independently.

2.1 Grid-Based Scenario

Experiment results of baselines and ACE under synchronous and asynchronous training are provided in Tab.1. In both settings, ACE outperforms planning-based baselines with ≥ 10% less time, full Coverage, and higher ACS. Comparing ACE with MAPPO, which is trained in a synchronous manner, ACE demonstrates similar ACS to MAPPO with less Time and Overlap, indicating the robustness of ACE to realistic execution with randomization action delay.

Tab.1 Performance of different methods under 2-agent synchronous and asynchronous action-making settings

Tab.2 Performance of different methods with decreased team size on 25 × 25 maps

We also set up a 15 × 15 real-world grid map which is the same as the grid-based simulation, and each grid is 0.31m long, as shown in Fig. 5. Our robots are equipped with Mecanum steering and an NVIDIA Jetson Nano processor. The locations and poses of robots are tracked by OptiTrack cameras and the Motive motion capture software. After training a policy in the grid-based simulator under 15 × 15 map with random rooms, we directly deploy it to the real-world robot system. Each real robot executes in a distributed and asynchronous manner. The robot adopts a request-send mechanism to obtain the newest feature embedding of other agents through ROS topic upon finishing all atomic actions.

We present the running time of different methods with 2 agents in real-world exploration tasks. As shown in Tab.3, two RL-based methods, MAPPPO and ACE, outperform the planning-based baselines by a large margin according to the total exploration time. Besides, ACE reduces 10.07% running time compared with MAPPO, proving that combining action-delay randomization with Async-MAPPO indeed improves the efficiency of multi-agent exploration.

Fig.5 The illustration of Grid-based Simulator and Real-world Robot System

Tab.3 Running time of different methods in real-world tasks

2.3 Habitat

We extend ACE to a vision-based environment, Habitat. Tab.4 shows the performance of different methods under 2-agent asynchronous action-making settings. Despite having higher Overlap due to more exhaustive exploration, ACE outperforms planning-based baselines with ≥ 28% less Time, higher Coverage and ACS. Compared with synchronous MAPPO, ACE still shows higher Coverage and ACS with less Time, demonstrating the effectiveness of ACE in more complicated vision-based tasks.

We also consider the setting of decreased team sizes in Habitat, and we follow the same experimental setup as for the grid-based simulations. Tab.5 shows the performance of different methods with decreased team size (2 ⇒ 1). ACE demonstrates 5.3% less Time than other baselines and obtains the highest Coverage and ACS with comparable Overlap, which indicates the ACE’s ability to generalize to agent lost.

Tab. 4 Performance of different methods under 2-agent asynchronous action-making settings in Habitat.

Tab. 5 Performance of different methods with decreased team size in Habitat.

2.2 Ablation Studies

Tab.6 summarizes the performances on different communication traffic with 2 agents on 25 × 25 maps. More communication between agents generally leads to better exploration efficiency, as is shown by the decreasing Time and increasing ACS from “No Comm.” to “Comm. (0.25x)”, “Comm. (0.5x)” and “Perf. Comm.”. Note that ACE performs even better than “Perf. Comm.” with strictly less communication, demonstrating the effectiveness of the feature embedding extracted from our CNN policy for decision-making.

To better illustrate the effect of different action-delay choices, we present the results in the “4 ⇒ 3” setting, an extreme scenario with agent loss. As shown in Tab.7, ACE consumes the least Time and achieves the highest ACS. The results show that action-delay randomization works best with a proper randomization interval, while a large randomization interval adds high uncertainty during training and hurts the final performance.

Tab. 6: Performance with different communication traffic