Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

Anonymous NeurIPS 2019 Submission 5028

In order to understand the effects of our various intrinsic reward functions, we visualize the intrinsic reward values over the whole map for each agent alongside the recently visited cell distribution over the course of training (for 500,000 steps). Intrinsic rewards (plotted in purple) are darker for regions with higher rewards, and the cell distribution (plotted in the respective agent's color) is darker for cells where the agents has spent more time in the most recent 1500 steps. We use the 2 agent version of task 1 for these visualizations, where agents must cooperatively collect both treasures on the map.

rewmap_indepexplore.mp4

Independent Rewards

  • Each agent explores the whole map independently, as expected
  • Even after 500,000 steps, agents are still exploring and have not learned to solve the task
rewmap_minexplore.mp4

Minimum Rewards

  • Agents avoid redundant exploration and generally only explore areas that have not been explored yet
  • This particular run results in a failure case where one agent happens to explore both regions where the treasure exists first, so the other agent never bothers to explore those areas
rewmap_meanexplore.mp4

Mean Rewards

  • With mean rewards, agents can visit the same cells repeatedly and continue to get high rewards as long as the other agent never goes there, so this run leads to degenerate behavior where agents exploit their immediately available intrinsic rewards without exploring
  • May be more effective when paired with a reward that encourages policy similarity, such that agents are exploring the same areas
rewmap_3explore.mp4

Covering Rewards

  • Agents are jumping around the map, frequently switching the regions that they are exploring
rewmap_4explore.mp4

Burrowing Rewards

  • Agents generally commit to one region and fully explore that region without very much switching
  • Reliably solves the task
rewmap_multi01234explore.mp4

Multi-Explore

  • Probabilities of policy selector at each step are plotted below
  • Reward maps are a weighted sum of all of the intrinsic reward types, weighted by the selector probabilities.
  • Note that multi-explore will sometimes use rewards that aren't successful when used on their own (e.g. independent) because they are more effective given what the agents have seen thus far
  • Reliably solves the task