Agents only observe the joint they control (theta_1 and theta_2 for red and blue agent respectively) and the target location (in black).
clockwise expert (in dataset)
counter-clockwise expert (in dataset)
ITD3+BC agents fail at agreeing on a convention
MOMA-PPO agents are able to agree on conventions and can even alternate between the two conventions depending on the target position
Each agent controls a different limb and only observes the joints of the limb it controls. The yellow (white in gifs) agent is the only one to additionally observe the torso (in white).
ITD3+BC trained team fail to coordinate and run in circle because the white agent does not manage to compensate for the other agents.
MOMA-PPO trained teams are able to produce satisfactory behaviors and the white agent learns to steer the ant towards the correct direction.