Influence-Based Multi-Agent Exploration

Task Pass

Two agents starting at the upper-left corner are only rewarded when both of them reach the other room through the door, which will open only when at least one of the switches is occupied by one or more agents.

We run our methods and baselines for 9000 PPO updates and show the visitation heat maps in the videos below. Development of several kinds of auxiliary rewards (EDTI, EITI, EDTI-intrinsic, r_infleunce, and Q-Q reward) are also included and shown in the top row.

EITI

Top row: EITI reward. Bottom row: visitation heat map during last 100 PPO updates.

EDTI

Top row: EITI reward. Bottom row: visitation heat map during last 100 PPO updates.

EDTI-intrinsic

Similar to the formulation of EDIT but exclude external rewards from the value function.

EDIT-intrinsic reward is diminishing during the learning process.

r_influence

Top row: r_influence reward. Bottom row: visitation heat map during last 100 PPO updates.

shared_critic

This baseline uses decentralized PPO with shared centralized critic augmented with centralized curiosity. From the heat maps above, agents search all possible configurations without bias even after the (switch1-door) configuration has been discovered.

Sharing critic may solve the task given enough time, but searching the state-action space uniformly can hardly accelerate exploration.

Q-Q

Use a similar method to EDTI but without explicit counterfactual formulation.


random

Random exploration even cannot find switch 1, which is about 30 steps away from the starting point.

cen

Decentralized PPO learners guided by the centralized curiosity. The agents' behavior are analogous to that in shared_critic, but the learning process is slower.

dec

Decentralized PPO encouraged by decentralized curiosity. The lack of coordination results in no agent having the incentive to occupy the switch, because it cannot trigger consistent intrinsic rewards. Therefore, agents have little chance to reach the right room.

cen_control

This baseline is similar to shared_critic but with a centralized PPO learner augmented by centralized curiosity. The centralized learner takes the joint observation as input and outputs the joint policy.

As expected, agents search the whole space uniformly. Although it learns the solution, uniform search could be a burden on efficient exploration.