Policy Gradient with Self-Attention for Model-Free Distributed Nonlinear Multi-Agent Games

Multi-agent games in dynamic nonlinear settings are challenging due to the time-varying interactions among the agents and the non-stationarity of the Nash equilibria. We consider model-free games, where agent transitions and costs are observed without knowledge of the transition and cost functions. We propose a policy gradient approach to learn distributed policies that follow the communication structure in multi-team games, with multiple agents per team. We model the policies as nonlinear feedback gains, parameterized by self-attention layers to account for the time-varying multi-agent communication topology.

Distributed Regulation Games

There is an optimality gap between distributed and a centralized optimal method in linear quadratic games (top left). Our method is able to match that gap under network constraints in multi-agent linear quadratic games with unknown cost and dynamics. Given known nonlinear cost and nonlinear multi-agent dynamics, DP-iLQR finds optimal open-loop trajectories for nonlinear multi-agent games that a admit formulation as a potential game (top right). However, we can observe that the differences between the near-optimal performance of our policy and the baseline are minimal after 100 iterations. In terms of collision avoidance and goal-reaching performance, both methods obtain comparable results, despite the differences in the available information. Qualitatively, in a multi-agent nonlinear game, our method (bottom left) is able to recover optimality of base policies (bottom right) computed under known cost and dynamic, despite the knowledge gap.

Multi-Agent Pursuit and Evasion with RL

We study a pursuit-and-evasion game, where two teams compete to optimize their own conflicting costs. We compare our policy with a MLP and an attention-based GNN. To assess the performance, we make the two teams compete against each other using all possible policy combinations. As shown in the tables, our method is the best competitor for either evaders or pursuers. When the minimum distance is high with low standard deviation, the evaders are much better than the pursuers. If the number of catches is high while the minimum distance is low with low standard deviation, then the pursuers are much better than the evaders. Hence, if the policies for both evaders and the pursuers are strong competitors, then the metrics should exhibit a minimum distance that is not too high nor too low with a somewhat high variance, implying a broad range of behaviors and suggesting that sometimes the evaders win and sometimes the pursuers win. Strong policies should outperform the others when competing against each other.

Experiments

We test our policy parameterization in a real-robot deployment. We zero-shot transfer the policies trained in the pursuit-and-evasion game into the Georgia Tech Robotarium. An important aspect to consider is the presence of control barrier functions that ensure safety during robot operation. These are imposed by the Robotarium platform and cannot be modified by the user or our method. The control barrier functions act on the boundaries of the arena and between robots, preventing collisions. Despite the gap in realism between the gym-based environment used to train the policy and the real-robot setting (including different agent dynamics and safety constraints), the policies display complex pursuit-and-evasion behaviors. For instance, the evaders deceive the pursuers and react in advance to their efforts to catch them, whereas the pursuers counteract and finally corner the evaders. These complex behaviors also happen in real deployments.

Page updated

Google Sites

Report abuse