Stackelberg Games for Learning Emergent Behaviors

During Competitive Autocurricula

Paper Abstract

Autocurricular training is an important sub-area of multi-agent reinforcement learning(MARL) that allows multiple agents to learn emergent skills in an unsupervised co-evolving scheme. The robotics community has experimented autocurricular training with physically grounded problems, such as robust control and interactive manipulation tasks. However, the asymmetric nature of these tasks makes the generation of sophisticated policies challenging. Indeed, the asymmetry in the environment may implicitly or explicitly provide an advantage to a subset of agents which could, in turn, lead to a low-quality equilibrium. This paper proposes a novel game-theoretic algorithm, Stackelberg Multi-Agent Deep Deterministic Policy Gradient (ST-MADDPG), which formulates a two-player MARL problem as a Stackelberg game with one player as the `leader' and the other as the `follower' in a hierarchical interaction structure wherein the leader has an advantage. We first demonstrate that the leader's advantage from ST-MADDPG can be used to alleviate the inherent asymmetry in the environment. By exploiting the leader's advantage, ST-MADDPG improves the quality of a co-evolution process and results in more sophisticated and complex strategies that work well even against an unseen strong opponent.

Competitive-Carpoles Demonstration

MADDPG: The players resulting from the symmetric environment are able to keep their own poles upright, yet, fail to break the balance of the other agent and win the game in most of the competitions.

ST-MADDPG: Some leaders in Stackelberg games learned to pull the follower out of the frame to win the game.

Robust Control for Hopper Demonstration

This hopper agent was trained by MADDPG with adversarial disturbance. Under random disturbance with significantly higher intensity compared to that of the training environment, this agent was not able to keep its balance for very long.

This hopper agent was trained by ST-MADDPG with adversarial disturbance. The hopper agent was set to be the leader of the Stackelberg game, and the disturbance generator was the follower. Under intense random disturbance, this agent performed significantly better than the one from MADDPG.

Fencing Game Demonstration

Heuristic Baseline Protector v.s. Attacker from LA0_Reg500 #3

Disadvantageous training environment results in passive attackers

This video demonstrates the behavior of the best-performing attacker from the ST-MADDPG settings with the protector as the leader (i.e. LA0_Reg500 #3 in Fig.3). This attacker is less aggressive in terms of engaging less in the game and not constantly trying to attack the target area.

Heuristic Baseline Protector v.s. Attacker from Normal #9

Shifting learning dynamic for more active attackers

This video demonstrates the behavior of the best-performing attacker from the Non-Stackelberg setting (i.e. normal #9 in Fig.3).

Heuristic Baseline Protector v.s. Attacker from LA1_Reg500 #3

Leader's advantage for aggressive attackers with emergent strategies

The top two attackers come from two ST-MADDPG settings where the attackers were the leader with a regularization value of 500 and 50 respectively (i.e. LA1_Reg500 #3 and LA1_Reg50 #9 in Fig.3). They both learned to trick the protector to move to a less manipulable joint configuration, and then attack the target area with low risk while the protector is partially trapped and busy moving out of that configuration.

Heuristic Baseline Protector v.s. Attacker from LA1_Reg500 #5

Overly aggressive attackers

This video demonstrates the behavior of the worst-performing attacker from the whole population(i.e. LA1_Reg500 #5 in Fig.3). This is an example of the attacker becoming too aggressive and resulting in being overly engaged with poor performance. The protector was able to heavily penalize the attacker by constantly making contact with the attacker within the target area.

Real Robot demo

The fencing game simulation environment we implemented in this work is physically accurate. All the learned policies for this environment support zero shot transfer to our real PR2 robot. We added a new demonstration video to the project website, showing our real robot runs a ST-MADDPG train defender policy and plays the fencing game with a real human user.

Page updated

Google Sites

Report abuse