R-MADDPG for Partially Observable Environments and Limited Communication
Rose E. Wang*, Michael Everett, Jonathan P. How
- * corresponding author: rewang [at] mit [dot] edu
There are several real-world tasks that would benefit from applying multiagent reinforcement learning (MARL) algorithms, including the coordination among self-driving cars. The real world has challenging conditions for multiagent learning systems, such as its partial observable and nonstationary nature. Moreover, if agents must share a limited resource (e.g. network bandwidth) they must all learn how to coordinate resource use. This paper introduces a deep recurrent multiagent actor-critic framework (R-MADDPG) for handling multiagent coordination under partial observable settings and limited communication. We investigate recurrency effects on performance and communication use of a team of agents. We demonstrate that the resulting framework learns time-dependencies for sharing missing observations, handling resource limitations, and developing different communication patterns among agents.
This paper proposes three recurrent multiagent actor-critic models for partially observable and limited communication settings. The models only take in a single frame at each timestep. Because they cannot communicate all the time, they need a way to remember the last communication they received from their team, when they last transmitted a message and how their actions affect the communication budget over time. Recurrency acts as an explicit mechanism to do just that. Our models extend the multiagent actor-critic framework proposed by MADDPG to enable learning in a multiagent, partially observable, and limited communication domain
Shown on the left, the recurrent actor critic models used in the experiments. The top row shows the models during training, and the bottom row shows the models during execution. Actors communicate with each other and share information (m). If they decide not to communicate or have no communication budget left, an empty message is sent.
This illustrates what the simultaneous arrival task is supposed to be. Multiple agents (in color) and a goal location (in black) are initialized in random locations, and the agents must coordinate to arrive at the goal location simultaneously. This assumes full observation using R-MADDPG.
This assumes a communication budget of 50 msgs. The policy was learned by R-MADDPG (recurrent actor and recurrent critic) under partially observable settings.
Under fully observable settings (top row), both MADDPG (red) and recurrent variants (green, blue, orange) perform similarly. Under partially observable (bottom row) settings, the recurrent actor (orange) and MADDPG (red) are unable to learn how to simultaneously arrive (d), and even how to move towards the goal (c). This demonstrates the importance of the recurrent critic in partially observable settings. For partial observability, the communication budget is set to 20 messages, shared between 2 agents over ∼ 100 timesteps per episode.
Cite:
@article{wang2019r,
title={R-MADDPG for Partially Observable Environments and Limited Communication},
author={Wang, Rose E and Everett, Michael and How, Jonathan P},
year={2019}
}