Discovering Diverse Multi-agent Strategic Behavior via Reward Randomization

We propose a simple, general and effective technique, Reward Randomization for discovering diverse strategic policies in complex multi-agent games. Combining reward randomization and policy gradient, we derive a new algorithm, Reward-Randomized Policy Gradient (RPG). RPG is able to discover multiple distinctive human-interpretable strategies in challenging temporal trust dilemmas, including grid-world games(MonsterHunt and Escalation) and a real-world web game, where multiple equilibria exist but standard multi-agent policy gradient algorithms always converge to a fixed one with a sub-optimal payoff for every player even using state-of-the-art exploration techniques (including RND, DIAYN, MAVEN). Furthermore, with the set of diverse strategies from RPG, we can (1) achieve higher payoffs by fine-tuning the best policy from the set; and (2) obtain an adaptive agent by using this set of strategies as its training opponents.


Most recent successes in games are based on decentralized multi-agent learning, where agents compete against each other and optimize their own rewards to gradually improve their strategies. In this framework, Nash Equilibrium (NE), where no player could benefit from altering its strategy unilaterally, provides a general solution concept and serves as a goal for policy learning and has attracted increasingly significant interests from AI researchers: many existing works studied how to design practical multi-agent reinforcement learning (MARL) algorithms that can provably converge to an NE in Markov games, particularly in the zero-sum setting.

Despite the empirical success of these algorithms, a fundamental question remains largely unstudied in the field: even if an MARL algorithm converges to an NE, which equilibrium will it converge to? The existence of multiple NEs is extremely common in many multi-agent games. Discovering as many NE strategies as possible is particularly important in practice not only because different NEs can produce drastically different payoffs but also because when facing unknown players who are trained to play an NE strategy, we can gain advantage by identifying which NE strategy the opponent is playing and choosing the most appropriate response. Unfortunately, in many games where multiple distinct NEs exist, the popular decentralized policy gradient algorithm (PG), which has led to great successes in numerous games including Dota II and Stacraft, always converge to a particular NE with non-optimal payoffs and fail to explore more diverse modes in the strategy space.

Consider an extremely simple example, a 2-by-2 matrix game Stag-Hunt, where two pure strategy NEs exist: a "risky" cooperative equilibrium with the highest payoff for both agents and a "safe" non-cooperative equilibrium with strictly lower payoffs. We show, from both theoretical and practical perspectives, that even in this simple matrix-form game, PG fails to discover the high-payoff "risky" NE with high probability. The intuition is that the neighborhood that makes policies converge to the "risky" NE can be substantially small comparing to the entire policy space. Therefore, an exponentially large number of exploration steps are needed to ensure PG discovers the desired mode. We propose a simple technique, Reward Randomization (RR), which can help PG discover the "risky" cooperation strategy in the stag-hunt game with theoretical guarantees. The core idea of RR is to directly perturb the reward structure of the multi-agent game of interest, which is typically low-dimensional. RR directly alters the landscape of different strategy modes in the policy space and therefore makes it possible to easily discover novel behavior in the perturbed game (right picture). We call this new PG variant Reward-Randomized Policy Gradient.

To further illustrate the effectiveness of RPG, we introduce three Markov games -- two gridworld games and a real-world online game All these games have multiple NEs including both "risky" cooperation strategies and "safe" non-cooperative strategies. We empirically show that even with state-of-the-art exploration techniques, PG fails to discover the "risky" cooperation strategies. In contrast, RPG discovers a surprisingly diverse set of human-interpretable strategies in all these games, including some non-trivial emergent behavior. Importantly, among this set are policies achieving much higher payoffs for each player compared to those found by PG. This "diversity-seeking" property of RPG also makes it feasible to build adaptive policies: by re-training an RL agent against the diverse opponents discovered by RPG, the agent is able to dynamically alter its strategy between different modes, e.g., either cooperate or compete, w.r.t. its test-time opponent's behavior.


In Monster-Hunt, there is a monster and two apples. The monster keeps moving towards its closest agent while apples are static. When a single agent meets the monster, it losses a penalty of 2; if two agents catch the monster at the same time, they both earn a bonus of 5. Eating an apple always gives an agent a bonus of 2. Whenever an apple is eaten or the monster meets an agent, the apple or the monster will respawn randomly. The monster may move over the apple during the chase, in this case, the agent will gain the sum of points if it catches the monster and the apple exactly. visualization of trained policy is shown below.


In Escalation, two agents appear randomly and one grid lights up at the initialization. If two agents step on the lit grid simultaneously, each agent can gain 1 point, and the lit grid will go out with an adjacent grid lighting up. Both agents can gain 1 point again if they step on the next lit grid together. But if one agent steps off the path, the other agent will lose 0.9L points, where L is the current length of stepping together, and the game is over. Another option is that two agents choose to step off the path simultaneously, neither agent will be punished, and the game continues.


Agar is a popular multiplayer online game. Players control one or more cells in a Petri dish. The goal is to gain as much mass as possible by eating cells smaller than the player's cell while avoiding being eaten by larger ones. Larger cells move slower. Each player starts with one cell but can split a sufficiently large cell into two, allowing them to control multiple cells. The control is performed by mouse motion: all the cells of a player move towards the mouse position. We transformed the Free-For-All mode of Agar into an RL environment and open-source it as a new MARL testbed for a wide range of problems, such as cooperation, team formation, intention modeling, etc.

We consider a simplified scenario with 2 players (agents) and script cells with small mass, which automatically runs away when an agent becomes close by. There is a low-risk non-cooperative strategy, i.e., two agents stay away from each other and hunt script cells independently. Since the script cells move faster, it is challenging for a single agent to hunt them. By contrast, two agents can cooperate to encircle the script cells, which significantly boosts hunting efficiency. However, cooperation is extremely risky for the agent with less mass: two agents need to stay close to cooperate but the larger agent may defect by eating the smaller one and gaining an immediate big bonus.

3.1 Split, Hunt, Attack, Cooperate

To quantify how cooperative or competitive agents' behaviors are, we count 4 types of behaviors for each policy, these behaviors are visualized below: S(plit) means catching a script agent ball by splitting, H(unt) means catching a script agent ball without splitting, A(ttack) means catching a learn-based agent ball, C(ooperate) means catching a script agent ball while the other learn-based agent is close by. Basically, cooperative agents always cooperate, while competitive agents have more the other 3 behaviors.

All (learn-based) agents are grey or black while script agents are colorful (the same below).





3.2 Diverse policies in standard settings

In standard settings, an agent gets a reward of x when increasing mass by x and gets a penalty of -x when losing a mass x.

In the original game(w = [1, 0]), agents often split, hunt script agents and attack each other, while they hardly ever cooperate. In other words, they are competitive. because of the positive reward to eat other, agents are encouraged to kill each other while cooperating, so too risk cooperation makes them mistakenly believe cooperation is undeserving comparing with catching script agents by oneself alone. Actually split and hunt are very inefficient comparing to cooperate, so they finally gain less rewards.

But when w = [1, 1], agents always cooperate and sometimes execute the other 3 behaviors, so they are cooperative. Sharing reward give them motivations to keep another partner surviving, so cooperation is sustained when policies converge.

w = [1, 0] (competitive)

#S:0.300 #H:0.361 #A:0.291 #C:0.483

w = [1, 1] (cooperative)

#S:0.859 #H:0.411 #A:0.526 #C:2.203

3.3 Diverse policies in aggressive settings

In aggressive settings, agent gets a reward of x when increasing mass by x but gets no penalty when losing mass.

Sharing reward(w = [1, 1]) is effective in standard settings, but too simple to encourage cooperation in aggressive settings. Sharing reward always leads to sacrifice policies, where the smaller agent send itself to the larger one, because it's too small to efficiently catch script agents, and keeping surviving always gain less reward than suisiding and gaining a huge reward when dying.

In the original game(w = [1, 0]), results is similar with that in standard settings.(competitive)

when w = [0, 1] or [0, 0], cooperative policies are learned. At this time, sacrifice and killing the partner has no positive reward, so cooperation forms stably. Actually when w = [0, 0], there is no reward sharing between 2 agents, but they still learn to cooperate.

w = [1, 1] (Sacrifice)

#S:0.836 #H:0.934

#A:0.552 #C:0.028

w = [1, 0] (Competitive)

#S:0.717 #H:0.361

#A:0.291 #C:0.483

w = [0, 1] (Cooperative)

#S:0.785 #H:0.344

#A:0.346 #C:2.327

w = [0, 0] (Cooperative aggressively)

#S:1.195 #H:0.699

#A:0.517 #C:1.603

when w = [0.5, 1], agent tries to hack this setting occasionally. since there is still 0.5x positive reward to eat the partner and they both share rewards, the agent exploit the setting as a reward pump: agents split to smaller balls, and eat part of each other (not kill) to gain reward. Then the smaller agent leaves away to catch script agents and get larger enough to split again. the process repeat...... the We call it perpetual since its behavior is similar to conception of the perpetual-motion machine.

Interestingly, perpetual gains most reward among all 5 joint-policies. But agents don't learn to hack for every time: Sometimes they learn to cooperate normally (like w = [0, 1]), and sometimes they learn to sacrifice (like w = [1, 1])

w = [0.5, 1]

(Perpetual(visualized here) | cooperative | sacrifice)

#S:0.678 #H:0.617 #A:0.670 #C:0.550