Learning to Play Pursuit-Evasion with Visibility Constraints
Abstract
We study the problem of pursuit-evasion for a single pursuer and an evader in polygonal environments where the players have visibility constraints. The pursuer is tasked with catching the evader as quickly as possible while the evader tries to avoid being captured. We formalize this problem as a zero-sum game where the players have private observations and conflicting objectives.
One of the challenging aspects of this game is due to limited visibility. When a player, for example, the pursuer does not see the evader, it needs to reason about all possible locations of the evader. This causes an exponential increase in the size of the state space as compared to the arena size. To overcome the challenges associated with large state spaces, we introduce a new learning-based method that compresses the game state and uses it to plan actions for the players. The results indicate that our method outperforms the existing reinforcement learning methods, and performs competitively against the current state-of-the-art randomized strategy in complex environments.
Pursuit-Evasion with Belief States
PEBS overview: Each agent receives a partial observation o_t and an image I_t in each iteration of the game. The agent employs a belief network F that encodes I_t to a belief state b_t. The belief state is used together with o_t as input to the policy network π_θ to compute actions a_t. To supervise the belief states, we decode b_t with D for mapping to a set of 2D points X which is then rendered with R to generate an estimate image of the contaminated region.
Snapshots of a sample game at various time instants, along with the obtained images (bottom row). After clearing the contaminated regions and locating the evader (blue cross), the pursuer (colored circle) captures its opponent.
Learned pursuer vs. rash evader in polygon environments
Results on indoor maps
Failure modes of the method
Appendix
A1. Baseline descriptions
For the players, we have a few options to use as observations.
Lidar: First is a two channel lidar-like array where the first channel corresponds to the depth to the nearest object (either wall or opponent) and second is a semantic segmentation with 0s for walls and 1s for the opponent. We use 32 rays for the input observations, and assign the semantic label of the ray closest to the opponent as 1 when it is visible.
Sampling: The second option is to use a sampling-based approach which computes a belief state by sampling k particles from the contaminated area. In addition, it stores the last seen positions of the opponent in a queue, when it is visible. The queue is initialized with particles from the unseen area. We use k = 5, and store the last 5 coordinates.
Image: Finally, we can use the 3 channel image I_t to compute actions. In this case, the policy network π_θ is a CNN. In our experiments, we also report results from a CNN+RNN policy using a long short-term memory block.