Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization

Abstract

We present a novel paradigm, Reward-Switching Policy Optimization (RSPO), to discover diverse strategies in complex RL environments by iteratively finding novel policies that are sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic reward via a trajectory-based novelty measurement during the optimization process. For sufficiently distinct trajectories, RSPO performs standard policy optimization with extrinsic rewards over them, while for trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward instead. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo-based continuous control to multi-agent stag-hunt games and Starcraft challenges.

MuJoCo Tasks

Half-Cheetah

Iteration #1: Normal running

Iteration #4: Jump and running

Iteration #2: Hand-standing running

Iteration #5: Leaning forward

Iteration #3: Flipped running

Hopper

Iteration #1: Normal hopping

Iteration #4: Small-step hopping

Iteration #2: Normal hopping

Iteration #5: Kneeling

Iteration #3: Charged hopping

Walker2d

Iteration #1

Iteration #4

Iteration #2

Iteration #5

Iteration #3

Humanoid

Iteration #1: Two feet mincing

Iteration #4: Balance by raising hand

Iteration #2: Stretching across

Iteration #5: High knee lifting

Iteration #3: Striding

SMAC

2c_vs_64zg

Iteration #1: aggressive left-wave cleanup

Iteration #2: cliff-sniping and smart blocking

Iteration #3: corner

Iteration #4: fire attractor and distant sniper

Iteration #5: aggressive right-wave cleanup

Iteration #6: cliff walk

2m_vs_1z

Iteration #1: swinging

Iteration #2: parallel hit-and-run

Iteration #3: one-sided veritcal swinging

Iteration #4: alternative distraction