UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

Training with p = 0.012

Example Episode #1 : UneVEn-Greedy-GPI with p = 0.012

  • Blue entities represent agents (predators) and orange ones represent prey.

  • # Preys Captured (Max=3): 3

  • # Of Miscoordinated Capture Attempts: 0

  • Return (Max=3): 3.0

  • Learns to minimize the number of miscoordination attempts and each agent waits for other agents to perfectly surround the prey, then take the capture action. Makes minimal mistakes to achieve higher reward.

Example Episode #2 : UneVEn-Greedy-GPI with p = 0.012

  • # Preys Captured (Max=3): 3

  • # Of Miscoordinated Capture Attempts: 2

  • Return (Max=3): 2.97

Example Episode #3 : UneVEn-Uniform-GPI with p = 0.012

  • # Preys Captured (Max=3): 3

  • # Of Miscoordinated Capture Attempts: 8

  • Return (Max=3): 2.904

Example Episode #4 : Other MARL Baselines with p = 0.012

  • # Preys Captured (Max=3): 0

  • # Of Miscoordinated Capture Attempts: 0

  • Return (Max=3): 0.0

  • Learns to minimize the number of miscoordination attempts by completely avoiding the prey as the learning is stuck in sub-optimal Nash Equilibrium due to relative overgeneralization.

  • This video is based on learned IQL policy with p=0.012, but other MARL methods such as QMIX, WQMIX, VDN, QTRAN, QPLEX, MAVEN also tend to learn a similar strategy and get stuck in sub-optimal solutions.

Please note that the speed of the videos can be adjusted for better viewing experience.