Multi-Exploration

Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning

Anonymous NeurIPS 2019 Submission 5028

In order to understand the effects of our various intrinsic reward functions, we visualize the intrinsic reward values over the whole map for each agent alongside the recently visited cell distribution over the course of training (for 500,000 steps). Intrinsic rewards (plotted in purple) are darker for regions with higher rewards, and the cell distribution (plotted in the respective agent's color) is darker for cells where the agents has spent more time in the most recent 1500 steps. We use the 2 agent version of task 1 for these visualizations, where agents must cooperatively collect both treasures on the map.

rewmap_indepexplore.mp4

Independent Rewards

Each agent explores the whole map independently, as expected
Even after 500,000 steps, agents are still exploring and have not learned to solve the task

rewmap_minexplore.mp4

Minimum Rewards

Agents avoid redundant exploration and generally only explore areas that have not been explored yet
This particular run results in a failure case where one agent happens to explore both regions where the treasure exists first, so the other agent never bothers to explore those areas

rewmap_meanexplore.mp4

Mean Rewards

With mean rewards, agents can visit the same cells repeatedly and continue to get high rewards as long as the other agent never goes there, so this run leads to degenerate behavior where agents exploit their immediately available intrinsic rewards without exploring
May be more effective when paired with a reward that encourages policy similarity, such that agents are exploring the same areas

rewmap_3explore.mp4

Covering Rewards

Agents are jumping around the map, frequently switching the regions that they are exploring

rewmap_4explore.mp4

Burrowing Rewards

Agents generally commit to one region and fully explore that region without very much switching
Reliably solves the task

rewmap_multi01234explore.mp4

Multi-Explore

Probabilities of policy selector at each step are plotted below
Reward maps are a weighted sum of all of the intrinsic reward types, weighted by the selector probabilities.
Note that multi-explore will sometimes use rewards that aren't successful when used on their own (e.g. independent) because they are more effective given what the agents have seen thus far
Reliably solves the task