Paper Abstract
Using competitive multi-agent reinforcement learning (MARL) methods to solve physically grounded problems, such as robust control and interactive manipulation tasks, has become more popular in the robotics community. However, the asymmetric nature of these tasks makes the generation of sophisticated policies challenging. Indeed, the asymmetry in the environment may implicitly or explicitly provide an advantage to a subset of agents which could, in turn, lead to a low-quality equilibrium. This paper proposes a novel game-theoretic MARL algorithm, Stackelberg Multi-Agent Deep Deterministic Policy Gradient (ST-MADDPG), which formulates a two-player MARL problem as a Stackelberg game with one player as the `leader' and the other as the `follower' in a hierarchical interaction structure wherein the leader has an advantage. In three asymmetric competitive robotics environments, we demonstrate how the ST-MADDPG algorithm can be used to improve the quality of co-evolution and result in more sophisticated and complex autonomous agents.
Competitive-Carpoles Demonstration
MADDPG: The players resulting from the symmetric environment are able to keep their own poles upright, yet, fail to break the balance of the other agent and win the game in most of the competitions.
ST-MADDPG: Some leaders in Stackelberg games learned to pull the follower out of the frame to win the game.
Robust Control for Hopper Demonstration
This hopper agent was trained by MADDPG with adversarial disturbance. Under random disturbance with significantly high intensity compared to that of the training environment, this agent was not able to keep its balance for very long.
This hopper agent was trained by ST-MADDPG with adversarial disturbance. The hopper agent was set to be the leader of the Stackelberg game, and the disturbance generator was the follower. Under intense random disturbance, this agent performed significantly better than the one from MADDPG.
Fencing Game Demonstration
Attacker from LA1_Reg500 #3 V.S. Heuristic Baseline Protector
The top two attackers come from two ST-MADDPG settings where the attackers were the leader with a regularization value of 500 and 50 respectively (i.e. LA1_Reg500 #3 and LA1_Reg50 #9 in Fig.3). They both learned to trick the protector to move to a less manipulable joint configuration, and then attack the target area with low risk while the protector is partially trapped and busy moving out of that configuration.
Attacker from Normal #9 V.S. Heuristic Baseline Protector
This video demonstrates the behavior of the best-performing attacker from the Non-Stackelberg setting (i.e. normal #9 in Fig.3).
Attacker from LA0_Reg500 #3 V.S. Heuristic Baseline Protector
This video demonstrates the behavior of the best-performing attacker from the ST-MADDPG settings with the protector as the leader (i.e. LA0_Reg500 #3 in Fig.3). This attacker is less aggressive in terms of engaging less in the game and not constantly trying to attack the target area.
Attacker from LA1_Reg500 #5 V.S. Heuristic Baseline Protector
This video demonstrates the behavior of the worst-performing attacker from the whole population(i.e. LA1_Reg500 #5 in Fig.3). This is an example of the attacker becoming too aggressive and resulting in being overly engaged with poor performance. The protector was able to heavily penalize the attacker by constantly making contact with the attacker within the target area.