Our study examines two design choices for discovering diverse RL strategies, i.e., diversity measure and computation framework. First, to accurately capture the behavioral difference, we propose to incorporate the state-space distance information into the diversity measure. In addition, we show that although population-based training (PBT) is the precise problem formulation, iterative learning (ITR) can achieve comparable diversity scores with higher computation efficiency, leading to improved solution quality in practice. Based on our analysis, we further combine ITR with two tractable realizations of the state-distance-based diversity measures and develop a novel diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. SIPO consistently produces strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.
Action-based measures fail to capture the behavioral differences that may arise when similar states are reached via different actions.
State occupancy measures do not quantify the degree of dissimilarity between states.
Empirically, PBT will NOT uniformly converge to different landmarks as computation can be either too costly or unstable.
By contrast, ITR repeatedly excludes a particular mode, such that policy in the next iteration can continuously explore until a novel mode is discovered.
Our algorithm can efficiently discover a wide spectrum of strategies in humanoid locomotion and challenging multi-agent environments.
All rendering files in GRF are available here!