Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning

Julien Roy*, Paul Barde*, Félix G. Harvey, Derek Nowrouzezarhai, Christopher Pal

* indicates equal contribution

Accepted for poster presentation at the NeurIPS 2020 main conference

Link to Paper

We present here some rollouts of agents trained with CoachReg on the tasks defined in Figure 4 of the submitted paper. As a reminder, here is a short description of each task that the agents learned to solve. In this demonstration, we are interested in showing how in addition to succeed at the particular task, agents successfully synchronise in using the same sub-policies at the same time.

SPREAD task

(original colors)
(here, colors represent which sub policy each agent has selected)

On SPREAD, the agents learn to recognise different situations based on how the landmarks are placed in the environment. Here, we see that the agents act with the sub-policy RED for vertically aligned landmarks, GREEN for horizontally aligned landmarks and PURPLE and GREEN for less structured landmark placements.

BOUNCE task

(original colors)
(here colors represent which sub policy each agent has selected)

On BOUNCE, the agents have identified two distinct situations: either the target is on the left, in which case they use their GREEN sub-policy or the target is on the right, in which case they use their PURPLE sub-policy.

COMPROMISE task

(original colors)
(here, colors represent which sub policy each agent has selected)

On COMPROMISE, the agents mainly use two sub-policies (PURPLE and GREEN) in a synchronised fashion. While this is an example for which it is more difficult to interpret what the mask-changes mean, it is however interesting to note that the agents change masks precisely at the same time and therefore have learned some common representation of the task.

CHASE task

(original colors)
(here, colors represent which sub policy each agent has selected)

Finally, on CHASE, the optimal strategy is to trap the prey into a corner. The agents have specialised into trapping it in the bottom-left corner and synchronise their sub-policies to push it there, showcasing their PURPLE sub-policy when they are further away from the corner and their BLUE sub-policy to maintain the prey in place once it is trapped.