Figure A1. GIF illustration for robotic combat game (Team Circle vs Team Square, both using a MARL policy trained through online self-play)
Figure A2. Learning curves for CED and OSP (win rate tested against BC policy for each point) under BC initialization (left) and random initialization (right)
Figure A3. Win rate tables for CED / OSP policy under BC / random initialization (each win rate tested under 1000 game simulations) at Iteration 3000 (left) and Iteration 5000 (right)