In multi-agent settings with mixed incentives, methods developed for zero-sum games have been shown to lead to detrimental outcomes. To address this issue, opponent shaping (OS) methods explicitly learn to influence the learning dynamics of co-players and empirically lead to improved individual and collective outcomes. However, OS methods have only been evaluated in low-dimensional environments due to the challenges associated with estimating higher-order derivatives or scaling model-free meta-learning. Alternative methods that scale to more complex settings either converge to undesirable solutions or rely on unrealistic assumptions about the environment or co-players. In this paper, we successfully scale an OS-based approach to general-sum games with temporally-extended actions and long-time horizons for the first time. After analysing the representations of the meta-state and history used by previous algorithms, we propose a simplified version called SHAPER. We show empirically that SHAPER leads to improved individual and collective outcomes in a range of challenging settings from literature. Lastly, we show that previous evaluation environments, such as the CoinGame, are inadequate for analysing temporally-extended general-sum interactions
SHAPER is a meta-Reinforcement Learning (RL) based Opponent Shaping (OS) method. It learns a best response to co-players' learning dynamics by retaining its hidden state over episodes. Shaper simplifies a previous meta-RL OS method architecturally and improves scalability empirically.
SHAPER
GOOD SHEPHERD
MFOS (ES)
MFOS (RL)
Episode 0
Episode 10
Episode 50
Epiosde 150
Episode 300
Episode 500
Episode 0
Episode 10
Episode 50
Epiosde 150
Episode 300
Episode 500
Episode 0
Episode 10
Episode 50
Epiosde 150
Episode 300
Episode 500
Episode 0
Episode 10
Episode 50
Epiosde 150
Episode 300
Episode 500