This paper investigates the multi-agent navigation problem, which requires multiple agents to reach the target goals in a limited time. Multi-agent reinforcement learning (MARL) has shown promising results for solving this issue. However, it is inefficient for MARL to directly explore the (nearly) optimal policy in the large search space, which is exacerbated as the agent number increases (e.g., 10+ agents) or the environment is more complex (e.g., 3D simulator). Goal-conditioned hierarchical reinforcement learning (HRL) provides a promising direction to tackle this challenge by introducing a hierarchical structure to decompose the search space, where the low-level policy predicts primitive actions in the guidance of the goals derived from the high-level policy. In this paper, we propose Multi-Agent Graph-Enhanced Commander-Executor (MAGE-X), a graph-based goal-conditioned hierarchical method for multi-agent navigation tasks. MAGE-X comprises a high-level Goal Commander and a low-level Action Executor. The Goal Commander predicts the probability distribution of the goals and leverages them to assign the most appropriate final target to each agent. The Action Executor utilizes graph neural networks (GNN) to construct a subgraph for each agent that only contains its crucial partners to improve cooperation. Additionally, the Goal Encoder in the Action Executor captures the relationship between the agent and the designated goal to encourage the agent to reach the final target. The results show that MAGE-X outperforms the state-of-the-art MARL baselines with a 100% success rate with only 3 million training steps in multi-agent particle environments (MPE) with 50 agents, and at least a 12% higher success rate and 2x higher data efficiency in a more complicated quadrotor 3D navigation task.
Fig. 1: Multi-agent Environment.
Multi-Agent Graph-Enhanced Commander-Executor (MAGE-X) consists of 2 components:
Goal Commander
Action Executor
In this section, we introduce the proposed framework, MAGE-X, to improve sample efficiency and the cooperation in multi-agent navigation tasks. The overview of our framework is demonstrated in Fig. 2. MAGE-X comprises two components, the Goal Commander and the Action Executor. The high-level Goal Commander follows the principles of centralized-training-centralized-execution (CTCE). In contrast, the low-level Action Executor is in a decentralized setting with partial observation of N agents, where agent k receives local observation. Agent k learns the policy to produce a distribution over actions at each time step. The process of a navigation task begins with the Goal Scheduler in the Goal Commander receiving the spawn locations of all the agents and the target goals. The probability distribution of the target goals is predicted by the scheduler, and each agent is assigned the most appropriate target goal rather than the subgoal. Therefore, the multi-agent navigation is converted to multiple single-agent navigation tasks in the multi-agent environment, where each agent is required to reach the given goal as quickly as possible while avoiding collisions. Take agent k as an example. The Subgraph Extractor takes in the observation of agent k and other agents and produces the subgraph of agent k only including its crucial neighbors to improve cooperation. The relationship of agent k and its target goal is extracted from the Goal Encoder, which is then sent to the State Extractor combined with the feature of agent k in the subgraph to endow the team representation with strong goal guidance. Finally, agent k takes the preliminary action from the Action Generator to complete the navigation task.
Fig. 2: Overview of Multi-Agent Graph-enhanced Commander-Executor (MAGE-X)
Goal assignment is a long-studied maximal matching problem, especially in scenarios with large-scale agents. The Goal Commander builds upon a Goal Scheduler module for target-goal assignment, which produces the probability distribution of target goals. The designed reward is related to the distance cost of the Hungarian algorithm, a state-of-the-art classical method to tackle the combinatorial optimization algorithm in graph theory.
The Action Executor is designed for high cooperation where agents speedily reach the designated goal with little collision in the multi-agent environment. The workflow of the Action Executor is illustrated in Fig. 3. It consists of the Subgraph Extractor, the Goal Encoder, the State Extractor, and the Action Generator. The Subgraph Extractor encodes the observations of all agents and yields a subgraph for agent k only containing the crucial teammates. The Goal Encoder extracts the correlation of agent k and its assigned goal, which is then fed into the State Extractor with the feature of agent k in the subgraph to endow the team representation with target goal guidance. Finally, the Action Generator produces the action for the agent k.
The environmental reward for each agent is the linear combination of the complete bonus, the distance penalty, and the collision penalty.
Fig. 3: Workflow of Action Executor, including a Subgraph Extractor, a Goal Encoder, a State Extractor for representation learning and an Action Generator.
To evaluate the effectiveness of our algorithm in large search space, we consider MPE with a massive number of agents and pybullet-gym-drones in 3D space as the experimental environments, as shown in Fig. 1. We select three typical tasks from these two environments: Simple Spread, Push Ball, and Drones. Then, we conduct experiments with N = 5, 20, 50 agents in Simple Spread and N = 5, 20 agents in Push Ball in MPE. Drone is a more complicated quadrotor navigation task in gym-pybullet-drones, which adopts aerodynamic models of quadrotors to narrow the gap between the simulation and the real world. Therefore, quadrotors in Drone require stronger cooperation to avoid the crash. We conduct experiments with 2 and 4 quadrotors in this 3D simulator.
We demonstrate the training curves in Fig.4 and the evaluation performance in table 1. The results suggest that MAGE-X performs the best, especially in scenarios with 50 agents, where it converges to the near-optimal solution quickly with a success rate of 100%, attaining training efficiency more than 10x higher than other competitors. The results show that our algorithm performs the best, especially in the scenario with 50 agents. This graph-enhanced hierarchical framework helps agents quickly handle the problem of reaching targets with high cooperation.
Fig. 4: Comparison between MAGE-X and other baselines in Simple Spread with N=5, 20, 50.
Tab. 1: Evaluation performance of the success rate between MAGE-X and baselines in Simple Spread.
As shown in Fig. 5 and Tab. 2, we conduct the experiments with N = 20, 50 agents in Push Ball. The difficulty of Push Ball lies in that it requires a two-stage goal assignment, i.e., the agents first get the designated ball and then reach the target landmark with the ball. Nevertheless, MAGE-X still achieves a 100% success rate with few training timsteps. The results express that the high-level Goal Commander in MAGE-X has the potential to simultaneously tackle multiple goal assignment problems, for it breaks this M-stage goal assignment into M independent tasks, only requiring the information of assigned targets.
Fig. 5: Comparison between MAGE-X and other baselines in Push Ball with N=5, 20.
Tab. 2: Evaluation performance of the success rate between MAGE-X and baselines in Push Ball.
We report the training performance in Fig. 6 and the evaluation results in Fig. 7 and Tab. 3. The experiment shows that MAGE-X is superior to other competitors with a success rate of 95% in N=2 and 79% in N=4 in the physically realistic 3D environment, Drone. Specifically, MAGE-X outperforms MAPPO, with 12% higher and 15% higher success rates in N=2 and 4, respectively. Furthermore, MAGE-X manifests high coordination with a low collision rate of 0.02% in N=2 and 0.12% in N=4, demonstrating that MAGE-X succeeds in assigning target goals to agents to coordinate and capture the interaction among agents.
Fig. 6: Comparison between MAGE-X and other baselines in Drone.
Fig. 7: Evaluation performance of the success rate and the collision rate between MAGE-X and baselines in Drone. The node in the upper left corner of the figure represents higher performance.
Tab. 3: Evaluation performance of the success rate and the collision rate between MAGE-X and baselines in Drones.
To illustrate the effectiveness of each component of MAGE-X, we consider 3 variants of our method in Simple Spread with 50 agents.
MAGE-X w. RG: We substitute the Goal Scheduler in the Goal Commander with random sampling without replacement to assign each agent a random target goal.
MAGE-X-Atten: We remove GCN in the Obs. Encoder and replace the Graph Encoder in the Action Executor with the attention module to extract the relationship of agents.
MAGE-X-MLP: We consider the MLP layer as the alternative to the Action Executor to capture the correlation between agents and goals.
Fig.8 and Tab. 4 summarize the performance of MAGE-X and its variants on training and evaluation, respectively. MAGE-X excels in data efficiency and final performance with a 100% success rate. MAGE-X-MLP degrades most, implying that the MLP layer is incapable of distinguishing the correlation of agents and target goals from the given information. MAGE-X w. RG lacks an appropriate Goal scheduler in the high-level Goal Commander, which may lead to agents being assigned distant goals. Therefore, MAGE-X w. RG reveals a worse performance with an 83% success rate. MAGE-X-Atten is slightly inferior to MAGE-X with a 7% lower success rate. We hypothesize that MAGE-X-Atten provides each agent with the attention weights of all the neighbors, where needless teammates may influence agents. On the contrary, MAGE-X agent only concentrates on crucial neighbors in the subgraph with useful information.
Fig. 8: Ablation study on MAGE-X in Simple Spread with 50 agents.
Tab. 4: Evaluation performance of Success Rate between MAGE-X and its variants in Simple Spread.