Language Instructed RL for Human-AI Coordination

Abstraction

One of the fundamental quests of AI is to produce agents that coordinate well with humans. This problem is challenging, especially in domains that lack high quality human behavioral data, because multi-agent reinforcement learning (RL) often converges to different equilibria from the ones that humans prefer. We propose a novel framework, instructRL, that enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We use pretrained large language models to generate a prior policy conditioned on the human instruction and use the prior to regularize the RL objective. This leads to the RL agent converging to equilibria that are aligned with human preferences. We show that instructRL converges to human-like policies that satisfy the given instructions in a proof-of-concept environment as well as the challenging Hanabi benchmark. Finally, we show that knowing the language instruction significantly boosts human-AI coordination performance in human evaluations in Hanabi.

Demo of InstructQ(Rank)

This bot was trained with instruction:

If my partner tells me the ‘rank’ of some of my cards, I should ‘play’ those specific cards. If my partner does something else, e.g. discards their card or tells me the ‘color’ of my cards, then I may ‘hint rank’ to my partner.

Demo of InstructQ(Color)

This bot was trained with instruction:

If my partner tells me the ‘color’ of some of my cards, I should ‘play’ those specific cards. If my partner does something else, e.g. discards their card or tells me the ‘rank’ of my cards, then I may ‘hint color’ to my partner.

Robust analysis with noisy LLM priors

We add noise by randomly flipping n% of the logits in the prior policy from L to 1-L. The plot on the left shows the cross-play scores between instructRL with noisy prior policies and the original ones (i.e. noise=0). For each noise level, we train policies with 2 seeds and evaluate each of them with the 3 seeds of the original policy (i.e. 6 cross-play pairs per data point in the plot). The shaded area is standard error.

From the plot we can see that the instructQ (rank) policy gradually diverge from the original policy after 10% of noise and the divergence accelerate after 40% noise. For instructQ (color), the policy diverges at a slower pace for noise <= 60% and the the process accelerate drastically after that point. In both cases, our method can withstand a decent amount of noise (10-15%).