Adaptive Algorithms for Pursuit–Evasion Games with Sophisticated Evaders

icra_intro_new.mp4

Abstract

We study the coordination of a team of two pursuer UAVs tasked with capturing an evader UAV in an urban environment. Each agent has a limited field of view and must navigate around obstacles that constrain both accessibility and visibility. Such settings allow the evader to employ sophisticated strategies to deceive pursuers and evade interception, while the pursuers remain agnostic to the evader’s origin, destination, and policy. This leads to two key challenges: (i) intermittent observability of the evader necessitates active search by the pursuers, and (ii) the evader’s unknown strategy may exploit occlusions and deviate from shortest paths to remain unintercepted. To address these challenges, we generate a family of evader behaviors using level-K reasoning from behavioral game theory. The level-0 evader follows the shortest path to its destination; for each k >= 0 we (i) compute a pursuer policy that is a best response to the level-k evader and then (ii) compute the level-(k+1) evader as a best response to that pursuer. Each policy generation requires solving a partially observable Markov decision process (POMDP), and produces a diverse set of evasive and deceptive trajectories. Using this family of evaders, we propose two complementary approaches for designing adaptive pursuer policies at test time. The Primitive Action Policy (PAP) trains a pursuer policy directly over the canonical action space by training against evaders sampled uniformly randomly from the level-K set. The Level-K Switching Policy (LKSP) casts pursuit as a meta-decision problem in which the pursuer's action set is restricted to selecting among the pre-trained level-K pursuer policies at each timestep. LKSP achieves substantially faster training and near-optimal capture performance. In contrast, PAP achieves optimal capture performance but generally demands substantially longer training time.

Environment

Training environment in a 2d grid world

Visualization in our 3D environment in AirSim

Diverse obstacles within the environment that limit movements among specific drones

Challenges

Challenge 1 - Modeling Evader Behavior

Our environment required testing against evaders that deviate in behavior from taking the shortest path for robustness.

Challenge 3 - Unknown Evader Start and Goal Positions

Start and goal positions are also unknown to pursuers, which further changes the Evader strategy

Challenge 2 - Responding to Behavior of Evader in Real-Time

Pursuers operated under no initial knowledge of the evader's policy. Pursuer policies must develop behaviors that allow them to adapt to unknown evader strategies in real time.

Challenge 4 - Long-Term Planning under Partial Observability

Evaders may plan against Pursuers, which must overcome occlusion by gathering information about evader movements over long time horizons.

Methodology - Addressing the Challenge Problems

Solution 1 - K-Levels of Diversely Intelligent Evaders

We developed a diverse set of evader policies by iteratively training evaders against emergent best response pursuer policies.

Training Pipeline for Best Response Pursuer Policies

Iterative Best Response Training

Solution 2 - Adaptive Pursuer Policies vs. Classification

Level-K Switching Policy (LKSP)

Primative Action Policy (PAP)

LKSP solves a partially observable MDP through taking actions from a set derived by the base level k pursuer policies

PAP is a policy trained against randomization over K levels of evaders by solving the same partially observable MDP used in training against a single, fixed evader level policy

Canonical LLP Action Space

Canonical HLP Action Space

PAP and LKSP differ by limitations to action space. While PAP has an unconstrained action space, LKSP has an action space limited to level-K pursuer actions.

Methodology - Adaptive Pursuer Policies

LKSP - The Level-K Switching Policy

lskp_new.mp4

LKSP tends to bias toward taking the highest trained pursuer policy, level 4.

PAP - Primitive Action Policy

pap_new.mp4

To quantify the difference in behavior against LKSP, we highlight similarity to base pursuer policies by taking the max l2 norm from PAP action to level-K pursuer policies at each timestep

Results

Pursuer vs Evader - First we analyze the quality of the set of pursuer teams vs the quality of the set of evaders

For every level k evader policy, the level k pursuer policy is the best response

Trained level k policies respond differently between each other given the same game history

Pursuer Policy Comparison - We then want to analyze how well the pursuers adapted with respect to PAP and LKSP

While PAP took substantially longer to train, both LKSP and PAP achieved the same win rate against uniformly random evader level 1...K policies

Comparing the winrate between adaptive policies PAP and LKSP, PAP outperforms LKSP with the exception of evader level 0

Best Response - Last we show how the best response relates to trajectory step and average path length

Across timesteps in a rollout, the policy selected by LKSP against evader level-k is not necessarily the best response

We observe the sophistication of evader levels by the ratio of the generated path length against the shortest possible path length. For any level above the trivial case, the evader path length increased, indicating the emergence of deceptive and evasive strategies

Page updated

Google Sites

Report abuse