Online Planning for Multi-UAV Pursuit-Evasion in Unknown Environments Using Deep Reinforcement Learning
Jiayu Chen*, Chao Yu* , Guosheng Li, Wenhao Tang, Shilong Ji,
Xinyi Yang, Botian Xu, Huazhong Yang, Yu Wang
Online Planning for Multi-UAV Pursuit-Evasion in Unknown Environments Using Deep Reinforcement Learning
Jiayu Chen*, Chao Yu* , Guosheng Li, Wenhao Tang, Shilong Ji,
Xinyi Yang, Botian Xu, Huazhong Yang, Yu Wang
In this paper, we address these challenges by integrating calibrated dynamics models of UAVs into training policy for multi-UAV pursuit-evasion tasks. The RL-based policy generates collective thrust and body rates (CTBR) control commands, balancing flexibility, cooperative decision-making, and sim-to-real transfer. The overall algorithm is named OPEN (Online planning for multi-UAV pursuit-evasion in unknown environments). OPEN proposes an attention-based, evader-prediction-enhanced network that integrates predictive information about the evader's movements into the RL policy inputs, improving the ability to cooperatively capture the evader with partial observation. Additionally, we introduce an adaptive environment generator to MARL training, which automatically generates diverse and appropriately challenging curricula. This boosts sample efficiency and enhances policy generalization to unseen scenarios. Finally, we employ a two-stage reward refinement process to regularize policy output and ensure stability of behavior during real-world deployment.
Please be patient while the video loads or refer to the video button for a simple version!
We demonstrate another pursuit-evasion behavior in the Passage scenario. Visualization results show that the evader, starting from the center of the arena, attempts to flee toward the right side while maintaining distance from the nearest quadrotor.
Multi-UAV pursuit-evasion has important applications in both military and civilian contexts, including UAV defense systems, adversarial drone engagements, and search-and-rescue. This paper aims to learn an RL policy for multi-UAV pursuit-evasion, perform online planning in unknown environments, and deploy it on real UAVs. The problems are:
Joint optimization of planning and control: UAV actions must be coordinated to capture the evader under partial observation, while avoiding environmental obstacles, preventing collisions, and adhering to dynamics model and physical constraints for safe and feasible flight.
Large exploration space: The 3D nature of UAVs, combined with varying scenarios, significantly expands the state space, resulting in a large number of samples required to find viable strategies using RL.
Policy generalization: RL strategies that are optimized for specific scenarios often fail to generalize to new environments.
Sim-to-real transfer: The sim-to-real gap, a common issue in RL, is particularly pronounced in multi-UAV pursuit-evasion tasks due to the physical constraints of UAVs and the need for agile, precise maneuvers.
Fig. 1 introduces the pipeline of our method. To minimize the sim-to-real gap, we first calibrate the parameters of the quadrotor dynamics model via system identification, which are then integrated into the GPU-parallel simulator to construct the multi-UAV pursuit-evasion task. We use collective thrust and body rates (CTBR) as control commands and train RL policy to output it via a SOTA MARL algorithm, MAPPO. To enhance active exploration under partial observability, we design an Evader Prediction-Enhanced Network that leverages an attention-based architecture to capture the interrelations within observations and a trajectory prediction network to forecast the evader’s movement. To further enhance sample efficiency and policy generalization, we propose an Adaptive Environment Generator to automatically generate curricula, enabling efficient exploration of the entire task space. Finally, with two-stage reward refinement, the learned policy is applied directly to real quadrotors without any real data fine-tuning.
The Evader Prediction-Enhanced Network mainly consists of two parts, the Evader Prediction Network and the Actor-Critic Network, whose network structure is shown in Fig. 2.
The Evader Prediction Network uses an LSTM to predict the evader's path for the next K timesteps, based on historical data of quadrotor positions and the evader’s states. If the evader is blocked and undetected, a marker value replaces its positions and velocities. We collect n+K timesteps during training, using the first n as inputs and the next K as labels.
The actor and critic networks are based on the Attention-based Observation Encoder. Each of the three components of the quadrotor’s observation is encoded separately into 128-dimensional embeddings. A multi-head self-attention module captures relationships between these embeddings. We combine the self-embeddings with the features and process them through an MLP to create feature h. The actor network uses h to parameterize actions with a Gaussian distribution, while the critic network estimates the state value using h in an MLP.
(a) Evader Prediction Network
(b) Attention-based Observation Encoder
We leverage two inductive biases: (1) different obstacle configurations require distinct pursuit strategies, and (2) under the same obstacle setup, the initial positions of UAVs and the evader significantly affect capture difficulty. Based on these insights, we decompose the task space into two modules: Local Expansion, which explores UAV and evader positions under fixed obstacles, and Global Exploration, which investigates diverse obstacle configurations to generate simulation environments (Fig. 4).
Local Expansion: The Local Expansion module enhances the quadrotors’ ability to capture the evader from any initial position within a fixed obstacle setup.
Global Exploration: The Global Exploration module explores unseen obstacle configurations by randomly sampling task parameters from the entire parameter space W, including the number and positions of obstacles, as well as the initial positions of the UAVs and the evader.
Our experiments are conducted using OmniDrones, a high-speed UAV simulator for RL policy training. We design four test scenarios (Fig. 4), illustrated in a top-down view. The within-distribution scenarios, Wall and Narrow Gap, are intended to challenge the pursuit strategy. The two out-of-distribution scenarios, Random and Passage, are designed to evaluate the method’s generalization to unseen environments.
We challenge our method with three heuristic methods (Angelani, Janosov, and APF) and two RL-based methods (DACOOP and MAPPO).
Angelani: Pursuers are attracted to the nearest particles of the opposing group, i.e., the evader.
Janosov: Janosov designs a greedy chasing strategy and collision avoidance system that accounts for inertia, time delay, and noise.
APF: APF guides pursuers to a target position by combining attractive, repulsive, and inter-individual forces. By setting the target of the pursuers to the evader’s position and adjusting the hyperparameters for these forces, the pursuers can navigate towards the evader while avoiding obstacles.
DACOOP: DACOOP employs RL to adjust the primary hyperparameters of APF, enabling effective adaptation to diverse scenarios.
MAPPO: MAPPO is a SOTA multi-agent reinforcement learning algorithms for general cooperative problems.
Table 1 shows the performance across four test scenarioss, focusing on corner case. From the table we can see that, in both within-distribution and out-of-distribution scenarios, our method significantly outperforms the chosen baselines.
Within-distribution scenario: In Wall, our method achieves over 98% capture rate, with the lowest collision rate and fewest capture steps. In Narrow Gap, it maintains a 100% capture rate and the lowest collision rate, despite slightly higher capture steps than MAPPO
Out-of-distribution scenario: In Random and Passage, our method achieves 100% capture rates, outperforming baselines (78.3% and 81.8%), and requires the fewest capture steps. While the collision rate in Passage is slightly higher (0.6%), the results demonstrate strong generalization to unseen scenarios.
Given the poor baseline performance, we further test the algorithms in a simpler, obstacle-free scenario. As shown in Fig.6a and 6b, baselines exhibit a sharp decline in capture rate as the capture radius decreases, underscoring the task's difficulty. A smaller radius demands more agile UAV behavior, requiring rapid formation adjustments to block all evader escape routes. However, heuristic algorithms and DACOOP, which rely on force-based adjustments for obstacle avoidance and pursuit, inherently limit UAV agility. When the capture radius is small (e.g., 0.3), MAPPO struggles to learn an optimal capture strategy, with the success rate converging to approximately 80%. In contrast, our method maintains high capture rates with only a slight increase in capture steps as the radius shrinks, demonstrating the superior cooperative capture capability of pure RL strategies.
MAPPO + EPN: MAPPO with the Evader Prediction Network .
MAPPO + AEG: MAPPO with the Adaptive Environment Generator.
MAPPO: remove both.
As shown in Fig.7, our method demonstrates the highest capture performance and improves sample efficiency by more than 50% compared to MAPPO. MAPPO achieves only a 90% capture rate with over 2.0 billion samples, highlighting the challenging nature of deriving an RL policy that considers UAV dynamics and performs well across diverse scenarios. And as shown in Tab.1, MAPPO's average performance on the training tasks is suboptimal, leading to significantly worse performance in corner cases such as the Wall and Narrow Gap scenarios.
We further evaluate our method and its variants in out-of-distribution (OOD) scenarios. As shown in Tab.2, our method achieves the highest capture rate and the lowest collision rate in the Random scenario, with capture timesteps comparable to``MAPPO + EPN''. In the Passage scenario, our method consistently outperforms all variants. The EPN module provides UAVs with the capability to predict the evader's strategy, allowing them to plan captures based on the evader's future trajectory. This significantly reduces the impact of OOD scenarios on policy performance. Additionally, the AEG module enhances the policy's ability to address corner cases, promoting better generalization across diverse environments. By integrating these two modules, our method surpasses all baseline variants.
As shown in Fig. 8, we evaluate the sensitivity of OPEN and baselines to varying evader speeds in the Passage scenario. Unlike obstacle-free settings, Passage allows the evader to exploit obstacle occlusion and speed advantages.
As speed increases, heuristic baselines and DACOOP show significant capture rate drops due to poor early interception. As the evader’s speed increases, MAPPO’s capture rate first drops then rises, as it learns a three-way interception strategy similar to OPEN. At higher speeds, the evader is more likely to be caught in its blind spot during turns, lacking time to adjust direction.
In contrast, our method maintains robust performance across speeds, achieving high capture rates for faster evaders without retraining, despite being trained only for v = 1.3.
In simulation environments, we observe four emergent behaviors in the test scenarios, which further illustrate the cooperative pursuit capabilities of our strategy. Here in order to visualize the behavior patterns more clearly, we provide a top view of our 3D scenario, and further use different colors to distinguish different states of the pursuer UAVs.
We explain in detail the emergent behaviors and compare our method with the best baseline (DACOOP) in the four test scenarios. The simulation videos are shown in Fig. 8.
Wall
OPEN
Best baseline (DACOOP)
Narrow Gap
OPEN
Best baseline (DACOOP)
Random
OPEN
Best baseline (DACOOP)
Passage
OPEN
Best baseline (DACOOP)
In Wall scenario, our approach achieves a double-sided surround strategy, where one quadrotor maintains surveillance while the other two approach the evader from both flanks. In contrast, baseline methods struggle to find a path to the evader quickly due to obstacles directly ahead.
In Narrow Gap scenario, unlike the baselines, which continuously follow the evader, our approach learns to take a shortcut and intercept the evader at its inevitable path.
In Random scenario, none of the drones detect the evader initially. Guided by the predicted evader trajectory, our method quickly navigates behind obstacles, successfully locating the evader hidden there.
In Passage scenario, our approach divides the quadrotors into three groups to block all possible escape routes for the evader. In contrast, baseline methods tend to greedily approach the evader, leaving an escape route open.
We further deploy the policy via sysID on three real Crazyflie 2.1 quadrotors, each with a maximum speed limited to 1.0 m/s. A motion capture system is used to obtain the quadrotors’ states. We utilize a virtual evader and input its true states to the actor and the evader prediction network when the evader is detected. The actor and the evader prediction network run on a local computer, which sends CTBR control commands at 100 Hz to the quadrotors via radio. Real-world experiments demonstrate that our method produces strategies consistent with those in the simulation, validating the feasibility of the capture policy on real quadrotors. It is worth mentioning that we use only one actor and evader prediction network throughout the real world experiments, and do not train for a specific scenario.
We conduct real-world experiments of four test scenarios and provide real-world shooting. Using real flight data, we reproduce the trajectory of the quadrotors and the virtual evader in a top-view and side-view to better understand their behaviors. In order to visualize the real-world experiment more clearly, we use circles of two colors representing whether or not the quadrotors have captured the evader.
Wall
Top-view
Side-view
Real-world shooting
Narrow GAP
Top-view
Side-view
Real-world shooting
Random
Top-view
Side-view
Real-world shooting
Passage
Top-view
Side-view
Real-world shooting