Estimation of bilinear terms using score function faces high variance esp. in case of long trajectories, which results in slow learning or no learning at all. One solution is to exploit environment structure and utilize value function for its estimation. But value function needs to be learnt in parallel to exploration of the environment. Another challenge is that the value function depends on the policy parameters but it is not parameterized by it. For example, In case of GAN's, applying this method can be straight forward because loss function is parameterized by generator and discriminator loss, which is not the case in reinforcement learning. We extended policy gradient theorem to multi-agent case, and further bilinear term is obtained by carefully differentiating gradient with parameters of the other agent and some calculus tricks (Appendix C). This gave rise to the competitive policy gradient theorem.
Okay, This might look quite scary, let’s take a min to sparse it. The gradient term is similar to the policy gradient theorem. It can be estimated from data and requires gradient with respect to policy parameters, multiplied with value function.
The bilinear term actually captures interaction between the agents. What do we expect from a true competitive player? The player would act keeping in mind how the other player played previously. This competitive essence is encapsulated in the bilinear term.
In Bilinear term, term 1 is the immediate interaction between players, term 2 is the interaction of player i’s behavior up to time step k with player j’s reaction at time step k, and the environment. Term 3 is the interaction of player j’s behavior up to time step k with player i’s reaction at time step k, and the environment.
Similar to policy gradient theorem this might not perform good and face high variance due to missing a notation of relative improvement. To overcome that, we subtract a baseline vector which is function of state, from Q. The same thing can be done here, we can use advantage function instead of Q function and it can be proved that it will not alter the value of bilinear term (Appendix D). And it can be extended to variants of returns such as Monte carlo, 1-step temporal difference or generalized advantage estimation.
We conducted experiments a set of six games ranging from finite discrete state action space to infinite continuous state action game, single shot repeated games to finite horizon sequential games. We compare performance in the term of convergence behavior, stability at high learning rate, policy evaluation. Detailed results are documented in the paper. Here are video's comparing performance of policy learnt by training of CoPG agent with that GDA agents.