MASP: Scalable Graph-based Planning towards Multi-UAV Navigation
Xinyi Yang*, Xinting Yang*, Chao Yu+, Jiayu Chen, Wenbo Ding, Huazhong Yang and Yu Wang+
*Equal Contribution, +Corresponding Authors
Tsinghua University
Xinyi Yang*, Xinting Yang*, Chao Yu+, Jiayu Chen, Wenbo Ding, Huazhong Yang and Yu Wang+
*Equal Contribution, +Corresponding Authors
Tsinghua University
We investigate multi-UAV navigation tasks where multiple drones need to reach initially unassigned goals in a limited time. Reinforcement learning (RL) has recently become a popular approach for such tasks. However, RL struggles with low sample efficiency when directly exploring (nearly) optimal policies in a large exploration space, especially with an increased number of drones (e.g., 10+ drones) or in complex environments (e.g., a 3-D quadrotor simulator). To address these challenges, we propose Multi-UAV Scalable Graph-based Planner (MASP), a goal-conditioned hierarchical planner that reduces space complexity by decomposing the large exploration space into multiple goal-conditioned subspaces. MASP consists of a high-level policy that optimizes goal assignment and a low-level policy that promotes goal navigation. We use a graph-based representation and introduce an attention-based mechanism as well as a group division mechanism to enhance cooperation between drones and adaptability to varying team sizes. The results demonstrate that MASP outperforms RL and planning-based baselines in task and execution efficiency. Compared to planning-based competitors, MASP improves task efficiency by over 27.92% in a 3-D continuous quadrotor environment with 20 drones.
Fig. 1: Substantial numbers of drones
Fig. 2: 3D complex environment
Multi-UAV Scalable GNN-based Planner (MASP) consists of 2 components:
Goal Matcher (GM)
Coordinated Action Executor (CAE)
As the joint state and action spaces grow exponentially with the number of drones, multi-UAV navigation tasks face a critical challenge of low sample efficiency in policy learning. To address this challenge, we propose Multi-UAV Scalable Graph-based Planner (MASP), a goal-conditioned HRL framework. MASP consists of a high-level policy for goal assignment and a low-level policy for goal navigation. Both policies are parameterized by neural networks with parameters, respectively. As each drone perceives an assigned goal rather than all possible goals, the hierarchical framework decomposes the state and action space into multiple goal-conditioned subspaces, effectively improving sample efficiency.
We adopt the centralized training with decentralized execution (CTDE) paradigm, equipping each drone with its own MASP and ensuring scalability to an arbitrary number of drones. Take drone k as an example. The high-level policy, referred as Goal Matcher (GM), assigns a goal to drone k at every high-level step. To achieve the scalability of drone numbers, we model drones and goals as two dynamically expanding graphs. We then apply attention-based mechanisms to compute matching scores between drone k and goals in these graphs, effectively extracting meaningful features from the variable-sized and high-dimensional graph representation to facilitate goal assignment. Once drone k receives its assigned goal, the low-level policy, termed Coordinated Action Executor (CAE), outputs its environmental action at each timestep. CAE proposes a group division mechanism to extract relationships among varying drone numbers while reducing the input dimension by focusing only on drones within the same group at a time. Notably, this mechanism is not applicable to GM, as goal assignment requires a global matching across all drones and goals. Incorporating the extracted drone relationships and goal positional information, drone k navigates toward its designated goal while maintaining cooperative behavior.
We optimize both GM and CAE using multi-agent proximal policy optimization (MAPPO), an extension of proximal policy optimization (PPO) specifically designed for multi-agent settings.
Fig. 3: Overview of Multi-UAV Scalable Graph-based Planner. Here, we take Drone k as an example.
The high-level policy handles goal assignment, which is a well-known NP-hard maximum matching problem. To solve this challenge, we introduce an RL-based solution, Goal Matcher, which operates in a discrete action space consisting of goal candidates. Its state space includes the positional information of all drones and goals. To ensure scalable representation, we construct two dynamically expandable fully connected graphs: the drone graph, G_A(X,E), and the goal graph, G_T(X,E). The node set X in R^{N, M}x L} represents drones or goals, where each node feature encodes positional information with L dimensions. The edge set E is initially with all connections set to 1.
As illustrated in Fig.3, we introduce a Self-Encoder to update node features in G_A or G_T, by capturing the spatial relationship among agents or goals. Inspired by the effectiveness of self-attention mechanisms in handling long and variable-length sequences in tex and images, we apply self-attention over graphs to compute X_attn in R^{{N, M}x L}, in G_A or G_T.
The updated node features Y in R^{{N, M}x L} are obtained by adding X_(attn) to X, allowing each node to encode spatial relationships with its neighbors. The formulation is as follows:
Here, MLP embeds X and the output of the self-attention mechanism X_(attn). This MLP then generates an embedding in R^{{N, M}x L}. Next, the output embedding is added to X to obtain the updated node feature Y.
Then, we introduce a Cross-Encoder to capture the relationship between drone k and goals, computing a matching score for goal assignment. Since the navigation distance is a key factor in multi-UAV navigation, and obstacle avoidance is handled in the low-level policy, we simply choose the L2 distance between drone k and each goal as the navigation distance, denoted as D in {R}^{M x 1}. We pass {Y_A}^k, Y_T, and D through a multi-layer perceptron (MLP) layer to extract spatial relationships between drone k and goals. A softmax operation is then applied to the MLP output, yielding the matching score, S^k in {R}^{M x 1}, representing the probability of assigning each goal to drone k. The formulation is as follows:
Here, MLP embeds {Y_A}^k, Y_T, and D, and outputs an embedding in {R}^{M x 1}. Then we apply a softmax operation over the output embedding to compute the matching score of all possible goals.
Finally, drone k selects the goal with the highest matching score at each high-level step.
Coordinated Action Executor takes the goal assigned by GM and navigates drone k toward the goal while minimizing collisions. In CAE, the action space consists of the environmental actions available to drone k, while the state space includes the locations and velocities of all drone, along with the location of the designated goal.
Group Division
During goal navigation, it is essential to extract drone relationships to ensure cooperation and collision avoidance. However, as the number of drones increases, directly using the full state information of all drones results in high input dimensionality, making training inefficient. To solve this, we introduce a group division mechanism by focusing only on drones within the same group at a time. As illustrated in Fig.4, we assume each group contains N_g agents and divide all agents, excluding drone k, into groups of (N_g-1) drones. If the drones cannot be evenly divided, some drones are randomly assigned to multiple groups to ensure group formation. Finally, drone k is added to each group, ensuring it to maintain connections with all other drones across groups.
The Workflow
In CAE, we partition drones into groups and represent each group as a fully connected subgraph, G_S(X,E). In each subgraph, the node set X in {R}^{N_g x L} encodes the positions and velocities of the drones, while the edges E are set to 1. Notably, each subgraph is newly constructed and distinct from the graphs used in GM. To capture drone relationships efficiently, we introduce Group Information Fusion, applying graph convolutional networks (GCN) to each subgraph and then aggregating the node features of drone k across all subgraphs using a mean operation over the graph-number dimension. This process yields an updated node feature of drone k, denoted as {Y_S}^k in {R}^{1 x L}, which encodes the relationships between drone k and all other drones.
Here, {X_{GCN}}^i in {R}^{N_g x L} denotes the updated node feature of the i-th subgraph via the application of GCN. {X_{GCN}}^(i,k) in {R}^{1 x L} denotes the updated node feature of drone k in the i-th subgraph.
Next, we receive the positional embedding of the designated goal, T^k in R^{ 1 x L}, along with the updated node feature of drone k. An MLP layer is then applied to extract the relationship between drone k and its assigned goal.
Here, the MLP embeds T^k and {Y_S}^k, outputting H^k in R^{ 1 x D_A}, where D_A represents the action dimension in CAE.
Finally, based on H^k, we generate an environmental action to guide drone k toward its goal.
Reward Design
The reward, {R_{CAE}}^k, for drone k in CAE encourages goal-conditioned navigation while minimizing collisions. {R_{CAE}}^k is a linear combination of three components:
(1) a complete bonus for successfully reaching the assigned goal, R_b: 1 if drone k reaches its assigned goal, otherwise 0.
(2) a distance penalty for task efficiency, R_d: the negative L2 distance between drone k and its assigned goal.
(3) a collision penalty for collision avoidance, R_c: computed based on the number of collisions at each time step.
Fig. 4: The workflow of the Group Information Fusion in the Coordinated Action Executor. Take Drone k as an example.
We adopt the centralized training with decentralized execution (CTDE) paradigm, where agents make independent decisions. We train GM and CAE simultaneously by using multi-agent proximal policy optimization (MAPPO), a multi-agent variant of proximal policy optimization (PPO). The training hyperparameters can be found in Table 1.
Tab. 1: Training Hyperparameters
We compare MASP with three representative planning-based approaches (ORCA, RRT*, Voronoi) and four prominent RL-based solutions (MAPPO, DARL1N, HTAMP, MAGE-X). In planning-based approaches, we utilize the Hungarian algorithm represented as `(H)' for goal assignment.
ORCA: ORCA is an obstacle avoidance algorithm that excels in multi-agent scenarios. It predicts the movements of surrounding obstacles and other drones, and then infers a collision-free velocity for each drone.
RRT*: RRT* is a sample-based planning algorithm that builds upon the RRT algorithm. It first finds a feasible path by sampling points in the space and then iteratively refines the path to achieve an asymptotically optimal solution.
Voronoi: A Voronoi diagram comprises a set of continuous polygons, with a vertical bisector of lines connecting two adjacent points. By partitioning the map into Voronoi units, a path can be planned from the starting point to the destination at a safe distance.
MAPPO: MAPPO is a straightforward extension of PPO in the multi-agent setting, where each agent is equipped with a policy with a shared set of parameters. We update this shared policy based on the aggregated trajectories from all drones. We additionally apply the attention mechanism to enhance the model's performance.
DARL1N: DARL1N is under an independent decision-making setting for large-scale agent scenarios. It breaks the curse of dimensionality by restricting the drone interactions to one-hop neighborhoods.
HTAMP: This hierarchical method for task and motion planning via reinforcement learning integrates high-level task generation with low-level action execution.
MAGE-X: This is a hierarchical approach to multi-UAV navigation tasks. It first centrally allocates target goals to the drones at the beginning of the episode and then utilizes GNN to construct a subgraph only with important neighbors for higher cooperation.
MPE: We present the training curves in Fig. 5 and the evaluation performance in Tab. 2. MASP outperforms all methods with the least training data, with its advantage in Steps and Success Rate becoming more evident as drone numbers increase. MASP presents the lowest Collision Rate, reflecting strong drone cooperation. Especially in scenarios with 50 drones, MASP achieves nearly 100% Success Rate, while other RL baselines fail to complete the task in any episode. Regarding planning-based methods, (H)RRT* and (H)Voronoi shows a comparable performance. This suggests that (H)RRT* omits complex constraints, suitable for scenarios with obstacles. (H)Voronoi partitions the map to promise a coordination. However, they still requires around 19.12% more Steps than MASP at N=50.
Fig. 5: Comparison between MASP and other baselines in MPE with N=5, 20, 50.
Tab. 2: Performance of MASP, planning-based baselines and RL-based baselines with N = 5, 20, 50 drones in MPE. The backslash in Steps denotes the agents can not reach the target 100% Success Rate in any episode.
OmniDrones: We report the training performance in Fig. 6 and the evaluation results in Tab. 3. MASP is superior to all competitors, achieving a 100% Success Rate at N=5 and 96% at N=20 in this complex 3-D environment. As for the Collision Rate, MASP outperforms other baselines due to high agent cooperation. In contrast to RL baselines, most planning-based baselines achieve over 90% Success Rate, primarily owing to the goal assignment algorithm without duplication. Thus, we focus more on Steps for comparison. The planning-based baselines requires at least 27.92% more Steps than MASP at N=20.
Fig. 6: Comparison between MASP and other baselines in Omnidrones with N = 5, 20.
Tab. 3: Performance of MASP, planning-based baselines and RL-based baselines with N = 5, 20 drones in OmniDrones. The backslash in Steps denotes the drones fail to reach 100% Success Rate in any episode.
Due to unstable communication or agent loss, we further consider scenarios where the team size decreases during an episode. With more goals than drones, some drones may need to first reach some goals and then adjust their plans to navigate towards the remaining goals. The Success Rate is defined as the ratio of the landmarks reached during the navigation to the total number of landmarks. We use "N1 ⇒ N2" to denote that an episode starts with N1 agents and switches to N2 after one-third of the total timesteps. The number of landmarks is N1. Since MASP is trained with a fixed team size, this setup presents a zero-shot generalization challenge. We use the MASP model trained with N=N1. For the Hungarian algorithm which centrally assign agents goals, we randomly select a subset of unreached goals matching the agent count at each global step. As RL baselines lack generalization in both network architecture and final performance, we compare MASP only with planning-based methods.
MPE: As shown in Tab. 4, compared to the setting with a fixed number of agents, the advantage of MASP becomes more apparent. MASP consumes around 47.87% fewer Steps than (H)RRT* and (H)Voronoi in the 50 ⇒ 20 scenario. This indicates that using the Hungarian algorithm for goal assignment is more challenging for varying team sizes.
Tab. 4: Performance of MASP and planning-based baselines with a varying team size in MPE.
OmniDrones: Tab. 5 presents evaluation results for zero-shot generalization in OmniDrones. The planning-based baselines fail to reach a 100% Success Rate in any episode. This challenge stems from the absence of certain agents, therefore the remaining drones need to dynamically adjust their plans to reach more than one goal. This is particularly difficult in such a complex 3-D environment. Conversely, MASP maintains an average Success Rate of 94% due to its flexible and adaptive strategy.
Tab. 5: Performance of MASP and planning-based baselines with a varying team size in OmniDrones. The backslash in Steps denotes the drones can not reach the target 100% Success Rate in any episode.
To evaluate the execution efficiency, we compute the per-step time of baselines and MASP in MPE with 50 drones, excluding environmental execution time. In Tab.6, MASP is at least 3x more efficient than planning-based competitors. Especially compared to the best planning-based competitor, (H)RRT*, MASP's computation time is 32x faster than it. While MASP has a comparable execution efficiency to RL-based methods, it outperforms them in task performance.
Tab. 6: Average performance on the per-step time.