Promoting Coordination through Policy Regularization in Multi-Agent Deep Reinforcement Learning
Julien Roy*, Paul Barde*, Félix G. Harvey, Derek Nowrouzezarhai, Christopher Pal
* indicates equal contribution
Accepted for poster presentation at the NeurIPS 2020 main conference
We present here some rollouts of agents trained with CoachReg on the tasks defined in Figure 4 of the submitted paper. As a reminder, here is a short description of each task that the agents learned to solve. In this demonstration, we are interested in showing how in addition to succeed at the particular task, agents successfully synchronise in using the same sub-policies at the same time.
SPREAD task
On SPREAD, the agents learn to recognise different situations based on how the landmarks are placed in the environment. Here, we see that the agents act with the sub-policy RED for vertically aligned landmarks, GREEN for horizontally aligned landmarks and PURPLE and GREEN for less structured landmark placements.
BOUNCE task
On BOUNCE, the agents have identified two distinct situations: either the target is on the left, in which case they use their GREEN sub-policy or the target is on the right, in which case they use their PURPLE sub-policy.
COMPROMISE task
On COMPROMISE, the agents mainly use two sub-policies (PURPLE and GREEN) in a synchronised fashion. While this is an example for which it is more difficult to interpret what the mask-changes mean, it is however interesting to note that the agents change masks precisely at the same time and therefore have learned some common representation of the task.
CHASE task
Finally, on CHASE, the optimal strategy is to trap the prey into a corner. The agents have specialised into trapping it in the bottom-left corner and synchronise their sub-policies to push it there, showcasing their PURPLE sub-policy when they are further away from the corner and their BLUE sub-policy to maintain the prey in place once it is trapped.