Decentralized Control of Quadrotor Swarms for Collision Avoidance in Obstacle-rich Environments using End-to-end Deep Reinforcement Learning

Abstract

Reinforcement learning for quadrotor control promises many benefits -- generalization across tasks, real-time executability, agility and maneuverability. Prior methods have showcased the ability to deploy learned controllers onto single quadrotors or quadrotor teams in simple, obstacle-free environments. However, the addition of obstacles increases the number of possible interactions exponentially and makes the long-horizon planning task more difficult, where RL algorithms are known to struggle. In this work, we demonstrate the ability to learn neighbor-drone-avoiding and obstacle-avoiding policies trained with end-to-end reinforcement learning and transferred zero-shot to real-life micro quadrotors. We provide our agents a curriculum and a replay buffer of the hardest episodes to stabilize training in these difficult scenarios. We implement an attention mechanism to attend to the many obstacle and neighbor drone interactions and, to the best of our knowledge, are the first employ this mechanism on policies deployed on real, compute constrained hardware.

EXPERIMENTAL SETUP

The experiment set up is as follows: 8-drone swarms, with obstacles of diameter 0.6m and height equal to the room height, and  obstacle density of the room is set to 20%. The learned policy allows the drones to maintain dense formations and reach static targets while effectively avoiding collisions with both neighboring drones and obstacles.

Baseline Experiments

 The video below displays our methods performance in comparison to the State-Of-The-Art.
First is SBC, a control-based method which is based on safety barrier certificates, computes safe acceleration commands, both of which utilizing real-time optimization to compute the next actions.
The second method is  GLAS, a learning-based method which utilizes imitation learning to mimic a centralized planner, and combines it with a safety module to make the computed actions safe.
The physics of agile nano-quadrotors is simulated at 1.0x timescale, therefore we can see that our method is 2x as fast as GLAS and 4x as fast as SBC.

DRONE Scalability

With further training, our models can adapt to larger drone swarm sizes. In the video below, we display a swarm of 32 drones. Note that the number of neighbor drones seen by agents (K=2) remains constant, which means that our method can adapt to larger swarms without the increase in per-drone computational demands. 

OBSTACLE Scalability

Similarly, our problem can also be scaled in terms of the density of the obstacles in the environment. Our model scales to 80% obstacle density without issue

Generalization

Though trained with two scenarios, our model can generalize to unseen scenarios like swapping goals and pursuit evasion.

Real World Experiments

These experiments have been transferred zero-shot to CrazyFlie2.1. Deploying a single-headed attention model on a device with 168 MHz CPU and 192 KB RAM, the controller runs at ~1k Hz.