Generating Diverse Cooperative Agents by
Learning Incompatible Policies
ICLR 2023 (notable-top-25%)

Rujikorn Charakorn
VISTEC, Thailand

Poramate Manoonpong
VISTEC, Thailand
SDU, Denmark

Nat Dilokthanakul
KMITL, Thailand

Summary

In this work, we propose to learn diverse behaviors via policy compatibility. Conceptually, policy compatibility measures whether policies of interest can coordinate effectively. We theoretically show that incompatible policies are not similar. Thus, policy compatibility—which has been used exclusively as a measure of robustness—can be used as a proxy for learning diverse behaviors. Then, we incorporate the proposed objective into a population-based training scheme to allow concurrent training of multiple agents. Additionally, we use state-action information to induce local variations of each policy. Empirically, the proposed method consistently discovers more solutions than baseline methods across various multi-goal cooperative environments. In multi-recipe Overcooked, we show that our method produces populations of behaviorally diverse agents, which enables generalist agents trained with such a population to be more robust. Finally, in high-dimensional complex SMAC environments, LIPO learns diverse winning strategies.

Visualization of learned joint policies in Point Mass Rendezvous (PMR)

Behaviors of 4 agents produced by a single run of LIPO in PMR-C.

Joint policy 1

Joint policy 2

Joint policy 3

Joint policy 4

Behaviors of 4 agents produced by a single run of LIPO in PMR-L.

Joint policy 1

Joint policy 2

Joint policy 3

Joint policy 4

Visualization of learned joint policies in multi-recipe Overcooked

An overview of multi-recipe Overcooked: a sample initial state (left) and possible cooking recipes (right).

Behaviors of 8 agents produced by a single run of LIPO.

Joint policy preference:

Tomato & Carrot Salad

Joint policy preference:

Single-ingredient recipes
(Lettuce, Onion, Tomato, Carrot)

Joint policy preference:

Chopped Lettuce

Joint policy preference:

Chopped Lettuce or
Chopped Tomato

Joint policy preference:

Chopped Onion

Joint policy preference:

Tomato & Lettuce Salad

Joint policy preference:

Chopped Lettuce

Joint policy preference:

Tomato & Carrot Salad

Visualization of learned joint policies in SMAC

See the results in 2-player SMAC (2m-vs-1z) here

See the results in 4-player SMAC (4s-vs-4z) here

BibTeX