PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination

Xingzhou Lou, Jiaxian Guo, Junge Zhang, Jun Wang, Kaiqi Huang, Yali Du

CASIA, The University of Sydney, UCL, KCL

                   [Paper]              [Human-AI Experiments Code]             [PECAN Code]   

Abstract

Zero-shot human-AI coordination holds the promise of collaborating with humans without human data. Prevailing methods try to train the ego agent with a population of partners via self-play. However, this kind of method suffers from two problems: 1) The diversity of a population with finite partners is limited, thereby limiting the capacity of the trained ego agent to collaborate with a novel human; 2) Current methods only provide a common best response for every partner in the population, which may result in poor zero-shot coordination performance with a novel partner or humans. To address these issues, we first propose the policy ensemble method to increase the diversity of partners in the population, and then develop a context-aware method enabling the ego agent to analyze and identify the partner's potential policy primitives so that it can take different actions accordingly. In this way, the ego agent is able to learn more universal cooperative behaviors for collaborating with diverse partners. We conduct experiments on Overcooked environment, and evaluate the zero-shot human-AI coordination performance of our method with both behavior-cloned human proxies and real humans. The results demonstrate that our method significantly increases the diversity of partners and enables ego agents to learn more diverse behaviors than baselines, thus achieving state-of-the-art performance in all scenarios.

PECAN

(a) Self-play training (SP). The ego agent is trained with a copy of itself. (b) Population-based training (PBT). The ego agent is trained with a population of partners. A partner is sampled at each iteration to cooperate with the ego agent. (c) The proposed PECAN method. A different policy-ensemble partner is generated at each iteration. The ego agent will collect trajectories during collaborating with the partner and recognize the partner’s level-based context at the beginning of each episode with a pretrained context-aware module  

Experiments

We will give experimental results of coordinating with human proxy models and real human players here.

Task

Five layouts (Cramped Room, Asymmetric Advantages, Coordination Ring, Forced Coordination and Counter Circuit) in Overcooked are adopted to evaluate the ego agent’s ability to coordinate with some novel partners.

Procedure

1) with human proxy models
We pair the agents with a human proxy model, a behavior-cloning agent that mimics human’s behaviors, to test their coordination performance. The effect of each proposed component of PECAN is studied in the ablation study. Besides, we design other experiments to reinforce our claim that (a) policy ensemble is able to improve partners’ diversity and (b) the PECAN agent learns a context-aware policy. 

2) with real humans

We recruit human players to evaluate the human-AI coordination ability of PECAN. The human players are required to give their subjective ratings to the agents. Finally two case studies are conducted to demonstrate the adaptiveness of PECAN in human-AI coordination. The case studies are given in the demo videos.


Results with Human Proxy Models

Main results on five layouts

Overall results with a human proxy model. PECAN outperforms the baselines on all five layouts. Especially in Asymm. Adv., the best score by PECAN agents exceeds the baselines by a very large margin (+26.8%). But in layouts that require less coordination like Cramped Rm., PECAN has relatively marginal performance advantage than the baselines, which indi- cates that PECAN effectively improves the ego agent’s ability to coordinate with its partner rather than to accomplish the task by itself.

Results of ablation study

Results of the ablation study. The performance drops when we ablate each component from PECAN (e for policy ensemble and c for context encoder), and in some layouts the performance is even worse than the baseline method. 




Effects of each component in PECAN. (a) the t-SNE results of partners’ policy in each episode. The results show policy ensemble is able to improve partners’ diversity. (b) 𝑐 = 𝑅𝑁𝐷 means feeding a random context to the ego agent and 𝑐 = 𝑆𝑃 means manually assigning context for the ego agent to indicate that the partner is from the self-play group 𝐺4. The results show that only with the correct context, the ego agent can perform well, which means the ego agent’s policy is context-aware.

Comparison between policy ensembles with and without level-based grouping. The ensembles with level-based grouping demonstrate larger diversity (upper row) and their levels of coordination skills are controllable based on the levels of their policy primitives (lower row). Colors in the lower row indicate which group the partner’s policy primitives belong to. 

Results with Real Humans

(a) The coordination performance of each method with real humans. PECAN outperforms the baselines on human-AI coordination. 

(b) Human players’ subjective ratings of the adaptiveness and their preference of the PECAN and the MEP agents. Human players give higher subjective ratings to PECAN on both adaptiveness and personal preferences.

Demo Videos  

We will give the demo videos in our human-AI coordination studies to show the adaptiveness of PECAN. Note that in this section, the human player will control the blue agent and the AI agent will control the green agent.

Demo 1: Cramped Room

MEP

PECAN

We (blue chef) intentionally block the agents'(green chef) way to the onions to see how the agents will react. The MEP agent stuck and stand still until we move aside, while our PECAN agent makes adjustments immediately and turn around to pick up onions from the other side. It shows the PECAN agents are more adaptive and capable of adjusting its policy according to the human player's behaviors.

Demo 2: Asymmetric Advantages

MEP

PECAN

The trick of this layout for the agent (green chef) is to pick up dishes from both pots and serve, because it's much nearer to the serving counter than the human player (blue chef). The MEP agent has very strong preference to collect dishes from Pot 1 and overlooks dishes in Pot 2, while the PECAN agent shows no such preference and serves much more dishes than the MEP agent.  It means that the PECAN agent’s policy is more diverse and adaptive.