This paper tackles the challenging task of maintaining formation among multiple unmanned aerial vehicles (UAVs) while avoiding both static and dynamic obstacles during directed flight. The complexity of the task arises from its multi-objective nature, the large exploration space, and the sim-to-real gap. To address these challenges, we propose a two-stage reinforcement learning (RL) pipeline. In the first stage, we randomly search for a reward function that balances key objectives: directed flight, obstacle avoidance, formation maintenance, and zero-shot policy deployment. The second stage applies this reward function to more complex scenarios and utilizes curriculum learning to accelerate policy training. Additionally, we incorporate an attention-based observation encoder to improve formation maintenance and adaptability to varying obstacle densities. Experimental results in both simulation and real-world environments demonstrate that our method outperforms both planning-based and RL-based baselines in terms of collision-free rates and formation maintenance across static, dynamic, and mixed obstacle scenarios. Ablation studies further confirm the effectiveness of our curriculum learning strategy and attention-based encoder.
We propose a two-stage RL training pipeline. In the first stage, we randomly search within the reward function space to find the best weight vector that balances all objectives. The second stage applies the reward function to a more complex scenario and solves the transformed single-objective task with curriculum learning.
We address the task of maintaining behavior-based UAV formation while avoiding both static and dynamic obstacles during directed flight. The task aims to achieve the following four objectives.
Our task is more challenging than those in previous works, because in our settings, static obstacles are very dense, leaving only small gaps for the drones to pass through (in extreme cases, about 0.2m), and dynamic obstacles move at high speeds, requiring the drones to respond quickly and make sharp turns. Additionally, our task scenario is partially observable, where the drones can only observe obstacles within a 2.5m range, posing a greater challenge to traditional control-based methods. Our default experiment scenario (mixed obstacles, 2 balls and 10 columns) looks as follows.
Metrics:
Collision-Free Rate (CFR). An episode is considered successful if drones avoid collisions with all obstacles and with each other while reaching a specified area in time. The collision-free rate is defined as the proportion of successful episodes out of the total episodes. This metric evaluates the performance of directed flight and obstacle avoidance. The higher, the better.
Formation Maintenance (FM). In successful episodes, we compute the unnormalized Laplacian distance between the target formation and the actual swarm configuration averaged over the episode. Unlike the normalized metric used during training, this measurement is rotation-invariant but sensitive to size. Hence, a smaller unnormalized Laplacian distance indicates a more desirable formation in terms of both shape and size.
Performance of all methods in mixed scenario (2 balls + 10 columns).
Varying the number of static obstacles.
Varying the number of dynamic obstacles.
Our method outperforms all baseline methods regarding both CFR and FM in the mixed obstacle scenario, and demonstrates comparable, if not superior, performance to the best baseline in extremely confined environments with 20 columns or 5 balls. The results demonstrate our method's strong capability to maintain effective formation while evading various obstacles, underscoring its ability to accomplish multi-objective tasks.
To demonstrate the advantages of our method and the weaknesses of the baseline methods, we select special test cases in static and dynamic scenarios. We further analyze the issues with the baselines in the "Appendix: Baseline Failure Analysis" section. These examples show that our method can achieve agile obstacle avoidance while maintaining the desired formation.
Our method, static scenario
Swarm-Formation, static scenario
R-Mader, static scenario
Swarm-RL, static scenario
Our method, dynamic scenario
Swarm-Formation, dynamic scenario
R-Mader, dynamic scenario
Swarm-RL, dynamic scenario
Weight vectors with similar satisfaction rates can yield diverse behaviors. Here we show the trajectories of 3 drones dodging a dynamic obstacle, guided by policies with similar satisfaction rates. This suggests that our method holds strong flexibility and policy diversity, allowing humans to choose behaviors that best align with their preferences.
Observation Encoder
Only our method exhibits satisfying success rates regarding all objectives. The variants either fail to reach the destination or collide with obstacles.
Curriculum Learning
Trained with the same amount of data, our curriculum achieves superior overall performance.
We deploy our RL policy and the planning-based baselines on Crazyflie 2.0. Through the zero-shot Sim2Real deployment, we validate the efficacy of the action smoothing objective.
Our method, static scenario
Swarm-Formation, static scenario
R-Mader, static scenario
Our method, dynamic scenario
Swarm-Formation, dynamic scenario
R-Mader, dynamic scenario
Our method, mixed scenario
Swarm-Formation, mixed scenario
R-Mader, mixed scenario
Swarm-Formation is designed to handle formation maintenance around static obstacles. We integrate a trajectory prediction module to facilitate dynamic obstacle avoidance. Upon detecting a dynamic obstacle, the module captures two sequential frames of the obstacle and fits a parabolic curve to estimate its trajectory. This estimated parabolic curve is then treated as a static obstacle and sent to the planner.
R-Mader focuses on avoiding both static and dynamic obstacles for multiple drones without considering formation. To incorporate formation, we set the desired formation as the target positions of the drones. For dynamic obstacle avoidance, R-Mader requires knowledge of the obstacles' trajectories. Therefore, we fit parabolic curves to the obstacles and provide the curve parameters to R-Mader.
Swarm-RL focuses on collision avoidance for static obstacles and inter-drone collisions but does not handle dynamic obstacles or formation maintenance. To address this, we incorporate the relative position and velocity of dynamic obstacles into its observation. Additionally, we introduce a velocity tracking reward to aid in training. As with R-Mader, we set the desired formation as the final positions of each drone.
Swarm-Formation has a low CFR because it is designed for large-scale environments but not for fine-grained, dense obstacle distributions, as in our setting. It also lacks proper trajectory prediction for high speed dynamic obstacles. Therefore, when a dynamic obstacle suddenly appears in front of a drone, the algorithm prematurely assumes a collision and ceases to control the drone.
R-Mader, which inherently incorporates dynamic obstacle avoidance, performs better in dynamic scenarios compared to Swarm-Formation, but still falls short of our method's performance. Similar to Swarm-Formation, R-Mader struggles to navigate through dense distributions of fine-grained obstacles, resulting in low CFR in static scenarios.
While Swarm-RL maintains high CFR in static obstacle scenarios, its formation is easy to break: when one drone changes its behavior to avoid obstacles, other drones simply ignore it and keeps flying forward. In dynamic settings, as the number of obstacles increases, the CFR drops significantly, indicating inadequate scalability.
In the first stage, an episode is considered satisfying if the drones:
Fly forward (move more than 40 m along y axis in 900 time steps);
Avoid collision (keep at least 0.15 m from other drones and obstacles);
Maintain formation (keep unnormalized Laplacian cost under 5).
In the second stage, the success criterium for each objective is as follows.
Directed flight: move more than 20 meters along y axis in 900 time steps;
Formation: keep unnormalized Laplacian cost under 5;
Obstacle avoidance: keep at least 0.15 m from other drones and obstacles;
Action smoothing: keep throttle difference between two consecutive time steps under 0.005. This threshold is defined empirically.
An episode is considered satisfying in the second stage if it satisfy all four criteria simultaneously.
We adopt collective thrust and body rates (CTBR) commands as policy actions to ensure robust sim-to-real transfer and agile control. These CTBR commands are subsequently converted to motor thrust commands using PID controllers. Concretely, the action for a single drone is expressed as a=(c, ω_roll, ω_pitch, ω_yaw), where c~[0, 1] indicates the collective thrust, and ω~[-π, π] signifies the body rates for the corresponding axes.
We test another action space, direct motor thrusts, in Swarm-RL setting, and find that policies using CTBR and thrust actions yield comparable performance, although CTBR policy converges faster. This suggests that the choice of action space does not significantly influence the task difficulty. For better real-world deployment, we use CTBR as action in our setting.