首页

An Efficient End-to-End Training Approach for Zero-Shot

Human-AI Coordination

Abstract

The goal of zero-shot human-AI coordination is to develop an agent that can collaborate with humans without relying on human data. Prevailing two-stage population-based methods require a diverse population of mutually distinct policies to simulate diverse human behaviors. The necessity of such populations severely limits their computational efficiency. To address this issue, we propose E3T, an Efficient End-to-End Training approach for zero-shot human-AI coordination. E3T employs a mixture of ego policy and random policy to construct the partner policy, making it both coordination-skilled and diverse. In this way, the ego agent is end-to-end trained with this mixture policy without the need of a pre-trained population, thus significantly improving the training efficiency. In addition, a partner modeling module is proposed to predict the partner’s action from historical contexts. With the predicted partner’s action, the ego policy is able to adapt its policy and take actions accordingly when collaborating with humans of different behavior patterns. Empirical results on the Overcooked environment show that our method significantly improves the training efficiency while preserving comparable or superior performance than the population-based baselines.

Code of this project is available at https://github.com/yanxue7/E3T-Overcooked

Figure 1. (a).The illustration of the self-play training framework, which trains the ego policy by pairing it with the copied partner policy. (b). The illustration of our E3T. E3T employs the self-play training framework, during training the parameters of the ego policy are learned, but the parameters used for the partner policy are copied from that of the ego. The green box shows the decision process of the ego agent, which depends on both the current observation and the predicted partner action distributions. The blue box shows that of the partner agent, where the partner actions are sampled according to the mixture of the random policy and the copied ego policy.

Experiments

We evaluate E3T on zero-shot collaborating with behavior-cloned human proxies, real humans, and AI baselines. Experiments are conducted on the Overcooked environment, a two-player cooperative game. When collaborating with AI or real humans, E3T achieves superior performance to exists baselines. In addition, E3T significantly improves the training efficiency compared to population-based methods.

Experiments on Overcooked

Illustration of training time and zero-shot coordination performance of baselines. The right y-axis is the required hours to train one model. The left y-axis represents the average rewards of collaborating with AI baselines across 5 layouts. E3T achieves superior performance on average the reward while significantly improving the training efficiency compared to state-of-the-art population-based methods.

Results on coordinating with human proxy models. We plot the mean and standard error of coordination rewards over 5 random seeds. E3T outperforms all other baselines in term of the average reward on 5 layouts. The parameter epsilon is set as 0.5 for all layouts except for Forced Coord., which sets the epsilon as 0.0. In this layout, the ego agent does not need too much exploration due to the range of activity being very narrow.

Results on collaborating with real humans. We plot the mean and standard deviation of humans collaborating with baselines across all layouts. These results show that E3T outperforms all other baselines when collaborating with real humans.

Results on coordinating with other baselines. These results show the normalized cross-play rewards between baselines. Each term corresponds the normalized rewards average over 5 layouts. On average, E3T achieves superior performance when zero-shot collaborating with AI baselines.

Demo of E3T collaborating with real humans

We compare E3T and state-of-the-art population-based MEP on zero-shot collaborating with real humans. The green chef is controlled by trained AI models, and the blue chef is controlled by a real human.

Analysis of ego agents' behavior adaptive diversity

MEP

On Cramped Room

E3T

When the blue chef controlled by the real human is standing in the upper right corner, the green chef controlled by the MEP model cannot serve soups, while the green chef controlled by the E3T model has learned to turn past the blue chef and then serve soups. This difference is because that E3T takes into account the human chef's historical behavior and can adjust its behavior accordingly.

MEP

On Forced Coordination

E3T

When the blue chef (controlled by the real human) puts two plates on empty worktops, the MEP model (controls the green chef on the left) does not know how to respond to this situation, while the E3T model (controls the green chef on the right) will continue to hand the plate to the blue chef. Therefore, we think that the E3T model is more adaptable than the MEP model.