Competitive Policy Optimization

A novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates

Welcome, to the project page of competitive policy optimization (CoPO) paradigm for Multi-agent Reinforcement learning (MARL). I (Manish), with this blog, would try to convince you on why CoPO is the right approach for MARL. The code and paper link can be found in the link tab. This work is part of my Master Thesis at ETH Zurich under supervision of Dr. Alex Liniger, and conducted while I was a visiting scholar in the CMS department at Caltech under Prof. Yisong Yue and Prof. Anima Anandkumar. I am thankful to all my co-authors for this work, a very special thanks to my friend and supervisor Dr. Kamyar Aziazzadenesheli for consistent support and direction throughout the project.

Links: [Paper Code Slides]

Motivation

Before diving into competitive policy optimization, lets take a minute to see why multi-agent reinforcement learning, games are interesting but challenging as well. Almost all real world scenarios deal with multiple agents, which behave competitively or cooperatively or mix of the above two. Games help to model such multi-agent scenarios for eg: social dilemmas. Another reason why games are interesting is because simple rule lead to deep concepts and we require intelligence to solve them. As an example, consider a soccer game that is guided by a simple rule which is, to score a goal, but it requires the players to be smart and interactive to beat the opponent. This reason motivates researchers to develop intelligence, perform learning in games, and potentially come up with smart and interactive players. Also, with recent advancements in games, it has been shown that machines are capable to not only compete but could learn superhuman strategies as well.

But in general, optimization in games is challenging. In chess or the game of go, at every step, many actions are possible, thus has very large search space and practically not feasible to obtain an optimal policy. GAN’s is another game example where a generator and a discriminator are involved in a zero-sum game. They face issues such as mode collapse, and one has to be careful in training by balancing between the agents. Autonomous driving is another application of MARL, where it is challenging due to the dilemma at turns and junctions. We used autonomous racing as one of the experimental setup, which is an extreme case of this where the dilemma exists through the track.

Even for simple games such as rock paper scissors, many MARL algorithm will fail to converge to NE and will show cyclic behavior because they optimize assuming that the opponent is fixed. Non-stationary is another root cause for failure in MARL; the agent optimizes policy according to its action set, but the environment is also altered with the other agent. Due to this exact reason, single-agent RL algorithms often fail in a multi-agent setting.

Overview

In this work, we attempt to address such problems by revisiting principal equations of multi-agent reinforcement learning. We propose a paradigm called competitive policy optimization, which leads to two algorithms competitive policy gradient and trust-region competitive policy optimization. We try to incorporate game-theoretic interactions between agents into their policy parameters update step and mitigate the issue we talked about. Lets formally define what we mean by games and multi-agent reinforcement learning.


Multi-agent reinforcement learning

The Game formulation

How to solve the game?

Recursive Reasoning

Gradient and Bilinear terms in the form of Score function

The figure on left shows policy trajectory for player 1 vs player 2 for a bilinear game while training with CoPO.

The figure on right compares, gradient flow of GDA with that of COPO. The gradient flow of GDA cycles or has gradient flow outward, while the CoPO flow, considering players’ future moves, has gradient flow toward the Nash equilibrium and converges.

What's next?

A closer look at CoPO shows that this paradigm does not require the knowledge of the environment if sampled trajectories are available. It neither requires full observability of the sates, nor any structural assumption on CoMDP, but the Monte Carlo estimation suffer from high variance.

In the next tab CoPG, we explicitly take the CoMDP structure into account to develop efficient algorithms and mitigate variance related issues.