We study the coordination of a team of two pursuer UAVs tasked with capturing an evader UAV in an urban environment. Each agent has a limited field of view and must navigate around obstacles that constrain both accessibility and visibility. Such settings allow the evader to employ sophisticated strategies to deceive pursuers and evade interception, while the pursuers remain agnostic to the evader’s origin, destination, and policy. This leads to two key challenges: (i) intermittent observability of the evader necessitates active search by the pursuers, and (ii) the evader’s unknown strategy may exploit occlusions and deviate from shortest paths to remain unintercepted. To address these challenges, we generate a family of evader behaviors using level-K reasoning from behavioral game theory. The level-0 evader follows the shortest path to its destination; for each k >= 0 we (i) compute a pursuer policy that is a best response to the level-k evader and then (ii) compute the level-(k+1) evader as a best response to that pursuer. Each policy generation requires solving a partially observable Markov decision process (POMDP), and produces a diverse set of evasive and deceptive trajectories. Using this family of evaders, we propose two complementary approaches for designing adaptive pursuer policies at test time. The Primitive Action Policy (PAP) trains a pursuer policy directly over the canonical action space by training against evaders sampled uniformly randomly from the level-K set. The Level-K Switching Policy (LKSP) casts pursuit as a meta-decision problem in which the pursuer's action set is restricted to selecting among the pre-trained level-K pursuer policies at each timestep. LKSP achieves substantially faster training and near-optimal capture performance. In contrast, PAP achieves optimal capture performance but generally demands substantially longer training time.
Our environment required testing against evaders that deviate in behavior from taking the shortest path for robustness.
Evaders may plan against Pursuers, which must overcome occlusion by gathering information about evader movements over long time horizons.
Training Pipeline for Best Response Pursuer Policies
Iterative Best Response Training