MASRL (Competitive Multi-Agent Self-supervised Reinforcement Learning)
Training
The RL loss comes from the policy output and the value output. And the self-supervised loss comes from the predicted opponent's policy versus the actual opponent's policy.
Here, Player 1 is being trained while Player 2 is held fixed. At m steps, this process alternate and each player will be trained at an interleave manner.
Learned Representation Visualization
A. Game of Pong with Pixel-PPO (baseline) on the left and MASRL (ours) on the right.
B. CNN attention visualization from the POV of Pixel-PPO (baseline), showing that it only focus on itself and the ball.
C. CNN attention visualization from the POV of MASRL (ours), showing that that it not only focuses on itself, the ball, and also its opponent. Knowing opponent's position is necessary ingredient to play for opponent exploitation.
Best Response Plot
Learning MASRL (ours, blue) and Pixel-PPO(baseline, orange) from scratch against an expert opponent
MASRL's self-supervised loss allows it to learn a strategy that gains larger reward at a faster rate.
Full Table of Results, against MAPPO, Pixel-PPO, MAA2C, MAACKTR, MAACER
Boxing
MASRL (dark agent)
Pixel-PPO (light agent)
Tennis
MASRL (pink agent)
Pixel-PPO (blue agent)