We present a novel paradigm, Reward-Switching Policy Optimization (RSPO), to discover diverse strategies in complex RL environments by iteratively finding novel policies that are sufficiently different from existing ones. To encourage the learning policy to consistently converge towards a previously undiscovered local optimum, RSPO switches between extrinsic and intrinsic reward via a trajectory-based novelty measurement during the optimization process. For sufficiently distinct trajectories, RSPO performs standard policy optimization with extrinsic rewards over them, while for trajectories with high likelihood under existing policies, RSPO utilizes an intrinsic diversity reward instead. Experiments show that RSPO is able to discover a wide spectrum of strategies in a variety of domains, ranging from single-agent particle-world tasks and MuJoCo-based continuous control to multi-agent stag-hunt games and Starcraft challenges.
Iteration #1: Normal running
Iteration #4: Jump and running
Iteration #2: Hand-standing running
Iteration #5: Leaning forward
Iteration #3: Flipped running
Iteration #1: Normal hopping
Iteration #4: Small-step hopping
Iteration #2: Normal hopping
Iteration #5: Kneeling
Iteration #3: Charged hopping
Iteration #1
Iteration #4
Iteration #2
Iteration #5
Iteration #3
Iteration #1: Two feet mincing
Iteration #4: Balance by raising hand
Iteration #2: Stretching across
Iteration #5: High knee lifting
Iteration #3: Striding
Iteration #1: aggressive left-wave cleanup
Iteration #2: cliff-sniping and smart blocking
Iteration #3: corner
Iteration #4: fire attractor and distant sniper
Iteration #5: aggressive right-wave cleanup
Iteration #6: cliff walk
Iteration #1: swinging
Iteration #2: parallel hit-and-run
Iteration #3: one-sided veritcal swinging
Iteration #4: alternative distraction