Learning to Play Pursuit-Evasion with Visibility Constraints

Abstract

We study the problem of pursuit-evasion for a single pursuer and an evader in polygonal environments where the players have visibility constraints. The pursuer is tasked with catching the evader as quickly as possible while the evader tries to avoid being captured. We formalize this problem as a zero-sum game where the players have private observations and conflicting objectives.

One of the challenging aspects of this game is due to limited visibility. When a player, for example, the pursuer does not see the evader, it needs to reason about all possible locations of the evader. This causes an exponential increase in the size of the state space as compared to the arena size. To overcome the challenges associated with large state spaces, we introduce a new learning-based method that compresses the game state and uses it to plan actions for the players. The results indicate that our method outperforms the existing reinforcement learning methods, and performs competitively against the current state-of-the-art randomized strategy in complex environments.

Pursuit-Evasion with Belief States

PEBS overview: Each agent receives a partial observation o_t and an image I_t in each iteration of the game. The agent employs a belief network F that encodes I_t to a belief state b_t. The belief state is used together with o_t as input to the policy network π_θ to compute actions a_t. To supervise the belief states, we decode b_t with D for mapping to a set of 2D points X which is then rendered with R to generate an estimate image of the contaminated region.

Snapshots of a sample game at various time instants, along with the obtained images (bottom row). After clearing the contaminated regions and locating the evader (blue cross), the pursuer (colored circle) captures its opponent.

Learned pursuer vs. rash evader in polygon environments

vid_0088.mp4
vid_0047.mp4
vid_0048.mp4
vid_0001.mp4
vid_0010.mp4
vid_0029.mp4
vid_0072.mp4
vid_0035.mp4
vid_0040.mp4

Results on indoor maps

vid_0067.mp4
vid_0077.mp4
vid_0063.mp4

Failure modes of the method

vid_0016.mp4
vid_0027.mp4
vid_0075.mp4

Appendix

A1. Baseline descriptions

For the players, we have a few options to use as observations.

Lidar: First is a two channel lidar-like array where the first channel corresponds to the depth to the nearest object (either wall or opponent) and second is a semantic segmentation with 0s for walls and 1s for the opponent. We use 32 rays for the input observations, and assign the semantic label of the ray closest to the opponent as 1 when it is visible.

Sampling: The second option is to use a sampling-based approach which computes a belief state by sampling k particles from the contaminated area. In addition, it stores the last seen positions of the opponent in a queue, when it is visible. The queue is initialized with particles from the unseen area. We use k = 5, and store the last 5 coordinates.

Image: Finally, we can use the 3 channel image I_t to compute actions. In this case, the policy network π_θ is a CNN. In our experiments, we also report results from a CNN+RNN policy using a long short-term memory block.