Continuously Discovering Novel Strategies via Reward-Switching Policy Optimization
Abstract
We present a novel paradigm, Reward-Switching Policy Optimization (RSPO), to discover diverse strategies in complex RL environments by iteratively finding novel policies that are sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic reward via a trajectory-based novelty measurement during the optimization process. For sufficiently distinct trajectories, RSPO performs standard policy optimization with extrinsic rewards over them, while for trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward instead. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo-based continuous control to multi-agent stag-hunt games and Starcraft challenges.
MuJoCo Tasks
Half-Cheetah
Iteration #1: Normal running
Iteration #4: Jump and running
Iteration #2: Hand-standing running
Iteration #5: Leaning forward
Iteration #3: Flipped running
Hopper
Iteration #1: Normal hopping
Iteration #4: Small-step hopping
Iteration #2: Normal hopping
Iteration #5: Kneeling
Iteration #3: Charged hopping
Walker2d
Iteration #1
Iteration #4
Iteration #2
Iteration #5
Iteration #3
Humanoid
Iteration #1: Two feet mincing
Iteration #4: Balance by raising hand
Iteration #2: Stretching across
Iteration #5: High knee lifting
Iteration #3: Striding
SMAC
2c_vs_64zg
Iteration #1: aggressive left-wave cleanup
Iteration #2: cliff-sniping and smart blocking
Iteration #3: corner
Iteration #4: fire attractor and distant sniper
Iteration #5: aggressive right-wave cleanup
Iteration #6: cliff walk
2m_vs_1z
Iteration #1: swinging
Iteration #2: parallel hit-and-run
Iteration #3: one-sided veritcal swinging
Iteration #4: alternative distraction