Fast Peer Adaptation with Context-aware Exploration

ICML 2024; Paper Link

Long Ma, Yuanfei Wang, Fangwei Zhong^, Song-Chun Zhu, Yizhou Wang

(*equal contribution, ^corresponding author)

Summary

Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to efficiently probe and identify the peer’s strategy, as this is the prerequisite for carrying out the best response in adaptation. However, it is difficult to explore the strategies of unknown peers, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i.e., to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.

Policy Visualizations in Overcooked

Here we show some visualizations of the PACE agent cooperating with different peer agents in the Overcooked environment.

As an example, the first peer agent here has a preference for potato onion salad. In episode 1, the PACE agent makes a lettuce broccoli salad, but the peer agent refuses to deliver it; the PACE agent then goes down to the lower room and discovers that the peer agent prefers onion. This active exploration behavior is unnecessary for making a dish but provides valuable information about the peer's preferences. Based on this observation, in episode 2, the PACE agent decides to make a dish containing an onion (tomato onion salad) but still gets rejected. The PACE agent tries again in episode 3 with the potato onion salad and finally succeeds.

Potato Onion Salad

Episode 1

Makes lettuce broccoli salad (failure);

discovers onion preference

Episode 2

Makes tomato onion salad (failure)

Episode 3

Makes potato onion salad (success)

Lettuce Broccoli Salad

Episode 1

Makes potato broccoli salad (failure);

confirms broccoli preference

Episode 2

Makes tomato broccoli salad (failure)

Episode 3

Makes lettuce broccoli salad (success)

Tomato Onion Salad

Episode 1

Discovers onion preference first;

makes lettuce onion salad (failure)

Episode 2

Makes tomato onion salad (success)

Episode 3

Repeats tomato onion salad (success)

Policy Visualizations for the Generalist in Overcooked

For comparison, we also show some visualizations of the Generalist agent, one of our baselines, cooperating with different peer agents in the Overcooked environment.

In the example of potato onion salad, we can see two major differences in behaviors:

The Generalist agent never goes down to the lower room, and as such, can only rely on random guesses to explore the peer's preferences. This is due to a lack of exploration rewards.
The Generalist agent fails to repeat the successful recipe it has discovered in the earlier episodes. This exploitation failure is potentially due to an inadequate capability of RNNs to summarize long-term contexts.

Potato Onion Salad

Episode 1

Makes lettuce carrot salad (failure)

Episode 2

Makes potato onion salad (success)

Episode 3

Makes tomato onion salad (failure)

Lettuce Broccoli Salad

Episode 1

Makes lettuce onion salad (failure)

Episode 2

Makes lettuce carrot salad (failure)

Episode 3

Makes potato onion salad (failure)

Policy Visualizations in Kuhn Poker

Game Tree

Policy Space (P2)

The game tree (left) shows the parameterized strategies of both players in the Kuhn Poker environment, where each node indicates a possible state during the game. Numbers at the leaf nodes are the payoffs of P1. Dominated strategies are removed. The right figure shows the parameterized policy space of P2, partitioned by their best responses from P1.

References

Figures of game tree and policy space (P2) in Kuhn Poker are from Hoehn, Bret, Finnegan Southey, Robert C. Holte, and Valeriy Bulitko. "Effective short-term opponent exploitation in simplified poker." In AAAI, vol. 5, pp. 783-788. 2005.

BibTeX

@InProceedings{pmlr-v235-ma24n,

title = {Fast Peer Adaptation with Context-aware Exploration},

author = {Ma, Long and Wang, Yuanfei and Zhong, Fangwei and Zhu, Song-Chun and Wang, Yizhou},

booktitle = {Proceedings of the 41st International Conference on Machine Learning},

pages = {33963--33982},

year = {2024},

volume = {235},

series = {Proceedings of Machine Learning Research},

month = {21--27 Jul},

publisher = {PMLR},

url = {https://proceedings.mlr.press/v235/ma24n.html},

}

Page updated

Google Sites

Report abuse

Fast Peer Adaptation with Context-aware Exploration

ICML 2024; Paper Link

Long Ma*, Yuanfei Wang*, Fangwei Zhong^, Song-Chun Zhu, Yizhou Wang

Summary

Policy Visualizations in Overcooked

Potato Onion Salad

Episode 1

Episode 2

Episode 3

Lettuce Broccoli Salad

Episode 1

Episode 2

Episode 3

Tomato Onion Salad

Episode 1

Episode 2

Episode 3

Policy Visualizations for the Generalist in Overcooked

Potato Onion Salad

Episode 1

Episode 2

Episode 3

Lettuce Broccoli Salad

Episode 1

Episode 2

Episode 3

Policy Visualizations in Kuhn Poker

Game Tree

Policy Space (P2)

References

BibTeX

Long Ma, Yuanfei Wang, Fangwei Zhong^, Song-Chun Zhu, Yizhou Wang