Improving Human-AI Coordination through Online Adversarial Training and Generative Models

Paresh Chaudhary, Yancheng Liang, Daphne Chen, Simon S. Du, Natasha Jaques

University of Washington

{pareshrc, yancheng, daphc, ssdu, nj}@cs.washington.edu

Paper | Code (Overcooked) | Code (CMG) | Colab

TLDR: We use generative models in an online adversarial training loop to train the learning agent against difficult coordination scenarios while maintaining realistic behavior.

Abstract

Being able to cooperate with new people is an important component of many economically valuable AI tasks, from household robotics to autonomous driving. However, generalizing to novel humans requires training on data that captures the diversity of human behaviors. Adversarial training is a promising method that allows dynamic data generation and ensures that agents are robust. It generates a feedback loop where the agent’s performance influences the generation of new adversarial data, which can be used immediately to train the agent. However, adversarial training is difficult to apply in a cooperative task; how can we train an adversarial cooperator? We propose a novel strategy that combines a pre-trained generative model to simulate valid cooperative agent policies with adversarial training to maximize regret. We call our method GOAT: Generative Online Adversarial Training. In this framework, the GOAT dynamically searches the latent space of the generative model for coordination strategies where the learning policy---the Cooperator agent---underperforms. GOAT enables better generalization by exposing the Cooperator to various challenging interaction scenarios. We maintain realistic coordination strategies by keeping the generative model frozen, thus avoiding adversarial exploitation. We evaluate GOAT with real human partners, and the results demonstrate state-of-the-art performance on the Overcooked benchmark, highlighting its effectiveness in generalizing to diverse human behaviors.

Try our interactive demo!

Interact with our agent (GOAT) in the embedded live demo below. Try out different baseline methods or different methods from prior work by selecting an option in the dropdown menu for Player 2. Use the following keyboard controls to play: the arrow keys control the movement of your agent, and the space key allows you to interact by picking up / setting down ingredients.

How does GOAT enable robust training?

Cooperator

Sampled Agent

Adversarial Sampling

The Cooperator is adversarially trained by sampling challenging scenarios from the latent space of a generative model. We can see that standard normal sampling in red is restricted to a region, whereas GOAT moves around in the continuous latent space.

Cooperation partners generated by the adversary over the course of training

0-500

Starts with only one pot

500-1000

Starts using both pots

1000-1500

Starts delivering soup

1500-1750

Tomato for both pots

Qualitative Scores Human Ratings

Counter Circuit

Multi Strategy Counter

Previous Overcooked layouts like Counter Circuit are saturated, such that several methods provide statistically equivalent performance. We test on a much harder layout called Multi Strategy Counter, where the GOAT is 38% better than prior work when evaluated in real time with novel human users.

Adversary

Minimax Adversary: In the adversarial training loop involving the generative model, the adversary can be formulated using pure minimax (where adversary attempts to minimize cooperation score), but our experiments show that this leads to adversary exploiting a single mode in the latent space (see the plot above) and that corresponds to partners that make the task harder to cooperate with (see the gif below where agent does not participate in the task).
Regret Adversary: Instead, we use a regret-based adversary in the adversarial training loop to choose cooperation partners that have a high self-play score (and are thus meaningful cooperative policies) but still challenge the Cooperator agent. As the Cooperator agent learns the regret objective is incentivized to search for different modes in the latent space. (see plot above)

Minimax Adversary

Regret Adversary

Cite:

@article{chaudhary2025improving,

title={Improving Human-AI Coordination through Adversarial Training and Generative Models},

author={Chaudhary, Paresh and Liang, Yancheng and Chen, Daphne and Du, Simon S and Jaques, Natasha},

journal={arXiv preprint arXiv:2504.15457},

year={2025}

}

Page updated

Google Sites

Report abuse