Coordinated Exploration via Intrinsic Rewards for Multi-Agent Reinforcement Learning
Anonymous NeurIPS 2019 Submission 5028
In order to understand the effects of our various intrinsic reward functions, we visualize the intrinsic reward values over the whole map for each agent alongside the recently visited cell distribution over the course of training (for 500,000 steps). Intrinsic rewards (plotted in purple) are darker for regions with higher rewards, and the cell distribution (plotted in the respective agent's color) is darker for cells where the agents has spent more time in the most recent 1500 steps. We use the 2 agent version of task 1 for these visualizations, where agents must cooperatively collect both treasures on the map.
rewmap_indepexplore.mp4
Independent Rewards
Independent Rewards
- Each agent explores the whole map independently, as expected
- Even after 500,000 steps, agents are still exploring and have not learned to solve the task
rewmap_minexplore.mp4
Minimum Rewards
Minimum Rewards
- Agents avoid redundant exploration and generally only explore areas that have not been explored yet
- This particular run results in a failure case where one agent happens to explore both regions where the treasure exists first, so the other agent never bothers to explore those areas
rewmap_meanexplore.mp4
Mean Rewards
Mean Rewards
- With mean rewards, agents can visit the same cells repeatedly and continue to get high rewards as long as the other agent never goes there, so this run leads to degenerate behavior where agents exploit their immediately available intrinsic rewards without exploring
- May be more effective when paired with a reward that encourages policy similarity, such that agents are exploring the same areas
rewmap_3explore.mp4
Covering Rewards
Covering Rewards
- Agents are jumping around the map, frequently switching the regions that they are exploring
rewmap_4explore.mp4
Burrowing Rewards
Burrowing Rewards
- Agents generally commit to one region and fully explore that region without very much switching
- Reliably solves the task
rewmap_multi01234explore.mp4
Multi-Explore
Multi-Explore
- Probabilities of policy selector at each step are plotted below
- Reward maps are a weighted sum of all of the intrinsic reward types, weighted by the selector probabilities.
- Note that multi-explore will sometimes use rewards that aren't successful when used on their own (e.g. independent) because they are more effective given what the agents have seen thus far
- Reliably solves the task