Learning Zero-Shot Cooperation with Humans,

Assuming Humans Are Biased

Chao Yu*, Jiaxuan Gao*, Weilin Liu, Botian Xu, Hao Tang, Jiaqi Yang, Yu Wang, Yi Wu

*Equal contribution

1.Introduction

There is a recent trend of applying multi-agent reinforcement learning (MARL) to train an agent that can cooperate with humans in a zero-shot fashion without using any human data. The typical workflow is to first repeatedly run self-play (SP) to build a policy pool and then train the final adaptive policy against this pool. A crucial limitation of this framework is that every  policy in the pool is optimized w.r.t. the environment reward function, which implicitly assumes that the testing partners of the adaptive policy will be truthfully optimizing the same reward function as well. 

However, humans can be fundamentally biased subject to their own preferences, which can be very different from the environment reward. We propose a more general framework, Hidden-Utility Self-Play (HSP), which explicitly models human biases as hidden reward functions in the self-play objective. By approximating the reward space as linear functions, HSP adopts a simple yet effective technique to generate an augmented policy pool with biased policies. We evaluate HSP  on the Overcooked benchmark. Empirical results show that HSP produces higher rewards than baselines when cooperating with learned human models, manually scripted policies and real humans. The HSP policy is also rated the most assistive policy based on human feedback.

2.Overcooked

The Overcooked Environment is based on the popular video game Overcooked where multiple players cooperate to finish as many orders as possible within a time limit. Fig.1 demonstrates five layouts we consider, where first three are onion-only layouts. In this simplified version of the original game, two chiefs, each controlled by a player (either human or AI), work in grid-like layouts. Chiefs can move between non-table tiles and interact with table tiles by picking up or placing objects. Ingredients (e.g., onions and tomatoes) and empty dishes can be picked up from the corresponding dispenser tiles and placed on empty table tiles or into the pots. The typical pipeline for completing an order is 

Figure 1: Layouts in Overcooked. From left to right are  Asymmetric Advantages, Coordination Ring, Counter Circuit, Distant Tomato and Many Orders respectively, with orders shown below.

3.Experiment Results

3.1 Cooperation with Learned Human Models

For evaluation with learned human models, we adopted the models provided by Carroll et al., which only support onion-only layouts, including Asymmetric. Advantage, Coordination Ring and Counter Circuit. Results are shown in Tab.1. HSP outperforms other methods in Asymm. Adv. and is comparable with the best baseline in the rest.  We emphasize that the improvement is marginal because the learned human models have limited representation power to imitate natural human behaviors, which typically cover many behavior modalities.

Table 1: Average reward and standard deviation with learned human models.

3.2 Ablation Studies

We investigate the impact of our design choices, including the construction of the final policy pool and the batch size for training the adaptive policy. The results are shown in Fig.1 and Fig.2.

Figure 1: Performance of different pool construction strategies.

Figure 2: Average game reward by using different numbers of parallel rollout threads in MAPPO.

More parallel threads indicate a larger batch size.

3.3 Cooperation with Scripted Policies

We empirically notice that human models learned by imitating the entire human trajectories cannot well capture a wide range of behavior modalities. So, we manually designed a set of script policies to encode some particular human preferences:

Tab. 2 shows the average game reward of all the methods when paired with scripted policies, where HSP significantly outperforms all baselines. In particular, in Distant Tomato, when cooperating with a strong tomato preference policy (Tomato Placement), HSP achieves a 10× higher score than other baselines, suggesting that the tomato-preferred behavior is well captured by HSP.

Table 2: Average reward and standard deviation with scripted policies.

3.4 Cooperation with Human Participants

We invited 60 volunteers (28.6% female, 71.4% male; median age between 18–30) and divided them into 5 groups for 5 layouts. The experiment has two stages. The first warm-up stage allows participants to play the game freely to explore possible AI behaviors. In the exploitation stage, participants are instructed to achieve a score as high as possible. 

Figure 3: Human preference for row partner over column partner.



The warm-up stage is designed to test the performance of AI policies in the face of diverse human preferences. We visualize the human feedback for different AI policies in Fig.3, where HSP policy is consistently preferred by humans with a clear margin.   Since humans can freely explore any possible behavior, the results in Fig. 3 imply the strong generalization capability of HSP.

The exploitation stage is designed to test the scoring capability of different AIs. Note that it is possible that a human player simply adapts to the AI strategy when instructed to have high scores. So, in addition to final rewards, we also examine the emergent human-AI behaviors to measure the human-AI cooperation level.

The experiment layouts can be classified into two categories according to whether the layout allows diverse behavior modes. 

Figure 4: (a) Average episode reward in onion-only layouts of different methods when paired with

humans in the exploitation stage. (b)The onion passing frequency in Counter Circ. layout.

Table 3: Average onion-preferred episode reward and frequency of different emergent behaviors in Distant Tomato during the exploitation stage. 

Table 4: Average episode reward and average number of picked-up soups from the middle pot by different AI players in Many Orders during the exploitation stage.

4.Visualization

4.1 Cooperation with Scripted Policies

Due to the space limit, we only show some representative results, more visualization results could be seen at the link. The files are named as "{blue}-{green}-{reward}.gif" to indicate the roles (two chiefs in blue/green) and the final reward.

Coordination Ring

Dish Everywhere

FCP 100 score

(Blue: AI, Green: Script)

MEP 120 score

(Blue: AI, Green: Script)

TrajDiv 120 score

(Blue: AI, Green: Script)

HSP 140 score

(Blue: AI, Green: Script)

Counter Circuit

Onion Everywhere

FCP 80 score

(Blue: AI, Green: Script)

MEP 120 score

(Blue: AI, Green: Script)

TrajDiv 100 score

(Blue: AI, Green: Script)

HSP 120 score

(Blue: AI, Green: Script)

Onion to Middle Counter

FCP 80 score

(Blue: AI, Green: Script)

MEP 100 score

(Blue: AI, Green: Script)

TrajDiv 120 score

(Blue: AI, Green: Script)

HSP 140 score

(Blue: AI, Green: Script)

Distant Tomato

Tomato Placement

FCP 0 score

(Blue: AI, Green: Script)

MEP 0 score

(Blue: AI, Green: Script)

TrajDiv 20 score

(Blue: AI, Green: Script)

HSP 300 score

(Blue: AI, Green: Script)

4.2 Cooperation with Human Participants

Coordination Ring

In Coordination Ring, the most annoying thing reported is players blocking each other during movement. To effectively maneuver in the ring-like layout, players must reach a temporary agreement on either going clockwise or counterclockwise. HSP is the only AI able to make way for the other player, while others could not recover by themselves once stuck. For example, both FCP and TrajDiv players tend to take a plate and wait next to the pot immediately after one pot is filled. But they can neither take a detour when blocked on their way to the dish dispenser nor yield their position to the human player trying to pass through.

FCP

(Blue: AI, Green: Human)

MEP

(Blue: AI, Green: Human)

TrajDiv

(Blue: AI, Green: Human)

HSP

(Blue: AI, Green: Human)

Counter Circle

In Counter Circuit, one efficient strategy is passing onion via the middle counter. A player at the bottom fetches onions and places them on the counter, while another player at the top picks up the onions and puts them into pots. We find HSP to be the only AI player capable of this strategy in both top and bottom places. Although when HSP is at bottom place, it tends to go top after passing 3 onions, we think that still shows HSP agents are better at adapting to different roles than the baselines.

FCP

(Blue: AI, Green: Human)

MEP

(Blue: AI, Green: Human)

TrajDiv

(Blue: AI, Green: Human)

HSP

(Blue: AI, Green: Human)

Distant Tomato

In Distant Tomato, one critical thing is that mixed (onion-tomato) soups give no reward, which means two players need to agree on the soup to cook. All methods perform well when the other player have no preference for tomatoes and focus on onion soups but fail to deal with tomato-preferring partners except for HSP. FCP, MEP, or TrajDiv agents never actively choose to cook tomatoes and may keep putting onions even when a pot has tomatoes in it, resulting in invalid orders. On the contrary, HSP can recognize another player's intention and choose to cooperate. Almost all participants agree that the HSP agent is the best partner to play with in this layout.

FCP

(Blue: AI, Green: Human)

MEP

(Blue: AI, Green: Human)

TrajDiv

(Blue: AI, Green: Human)

HSP

(Blue: AI, Green: Human)

Many Orders

In Many Orders, most participants claim that HSP is able to pick up soups from all three pots, while other AIs only concentrate on the pot in front of them and ignore the middle pot even if the human player attempt to use it.

FCP

(Blue: AI, Green: Human)

MEP

(Blue: AI, Green: Human)

TrajDiv

(Blue: AI, Green: Human)

HSP

(Blue: AI, Green: Human)