LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination
Jijia Liu*, Chao Yu*, Jiaxuan Gao*, Yuqing Xie, Qingmin Liao, Yi Wu, Yu Wang
Tsinghua University
* Equal Contribution
LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination
Jijia Liu*, Chao Yu*, Jiaxuan Gao*, Yuqing Xie, Qingmin Liao, Yi Wu, Yu Wang
Tsinghua University
* Equal Contribution
LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited interaction abilities.Â
Fig. 1 demonstrate a case showing the three challenges.
Real-time Responsiveness. Chop immediately after receiving the command.
Command Reasoning. Infer that"One more!" refers to "tomato".
Bilateral Communication. Generate a response "I've chopped 3".
Figure 1: A Case Demostrating Challenges
Figure 2: Framework of HLA
In this work, we propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. HLA consists of three parts, as depicted in Fig. 2.
Slow Mind: A proficient LLM for intention reasoning and language interaction
Fast Mind: A lightweight LLM for generating macro actions
Executor: A script policy for transforming macro actions into atomic actions
Overcooked is a cooperative cooking game where participants must work together to prepare, cook, and promptly serve a variety of dishes.Â
The cooking process is depicted in Fig. 3. We also design 4 distinct maps as shown in Fig. 4.
Figure 3: Game Process
Figure 4: Maps: Ring, Bottleneck, Partition, Quick
We introduce three baseline agents, each lacking a certain component of the original HLA.
Slow-Mind-Only Agent (SMOA). We remove the Fast Mind and let the Slow Mind produce macro actions.
Fast-Mind-Only Agent (FMOA). We remove the Slow Mind and let the Fast Mind generate chat message.Â
No-Executor Agent (NEA). We remove the Executor and let the Fast Mind choose atomic actions to control the agent directly.
We visualize some of the experiments (full experiments are shown in the paper), including simple command, complex command and human studies.Â
In the visualization results:
The game score can be found in the bottom-left corner of the screenshot.
The blue text within the screenshot shows the human player chat message and AI agent chat response.
The pink beard charactor is controlled by human, and the blue charactor is controlled by AI.
Simple commands are designed to test the AI's ability of achieving high game score, either in the absence of explicit human commands or after human commands is fulfilled. We design 2 test cases here, i.e. No Command and One Command.
In No Command test case, the human player only chops ingredients and does not issue any command throughout the game.
As shown in the visual results, NEA is stuck at the top left corner of the map. This is because NEA continuously gives the atomic action of moving "left" and is stuck next to the tomato tile. SMOA successfully mixes the ingredients and cooks soups. But upon completion of the game, the AI player fails to plate the soup in time, and thus overcooks the soup and sets the pot on fire. FMOA gives hallucinating chat response. At 2/3 of gameplay, FMOA talks about "the burning pot", but no pot is on fire at that time. HLA act most swiftly amongst all, exhibiting its real-time action ability.
(Human: pink beard character, AI: blue character)
No-Executor Agent
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)
In One Command test case, the human player only chops ingredients and asks the AI agent to prepare Bob Soup at the start of the game. The order for Bob Soup only appears once at the start of the game.
As shown in the visual results, NEA is still stuck at the top left corner of the map. SMOA overcooks the soup again. FMOA gives hallucinating chat response. At 2/3 of gameplay, FMOA still talks about "Bob Soup is almost ready", but the Bob Soup is already served prior and there is no Bob Soup order at that moment. HLA adheres human command accurately and focuses on maximaxing game score once the command is satisfied, showing its superior cooperative capability.
(Script tester: pink beard character, AI: blue character)
No-Executor Agent
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)
Complex commands are designed to test the AI's ability to comprehend and respond to commands of varying complexity. There are 3 types of challenges for complex commands, i.e. Quantity Specification, Semantic Analysis and Ambiguous Reference. We demonstrate 2 test case for each challenge.
NEA is stuck and cannot perform any task. SMOA can finish most of the commands, but takes much longer time than HLA. FMOA struggles to finish most of the commands. HLA can follow human commands accurately and swiftly, showing its strong command reasoning ability and real-time acting capability.
(Script tester: pink beard character, AI: blue character)
Quantity Specification Case 1: The human player says "3 chopped tomatoes please." at the begining of the game.
No-Executor Agent
Failed
Slow-Mind-Only Agent
Failed
Fast-Mind-Only Agent
Failed
HLA (ours)
Success in 42.9s
Quantity Specification Case 2: The human player says "2 lettuces chop." at the begining of the game.
No-Executor Agent
Failed
Slow-Mind-Only Agent
Success in 39.3s
Fast-Mind-Only Agent
Failed
HLA (ours)
Success in 19.7s
Semantic Analysis Case 1: The human player says "Chop but except tomato and lettuce." at the begining of the game.
No-Executor Agent
Failed
Slow-Mind-Only Agent
Success in 12.7s
Fast-Mind-Only Agent
Success in 17.5s
HLA (ours)
Success in 10.1s
Semantic Analysis Case 2: The human player says "Oh got, I forget the alice soup order." at the begining of the game.
No-Executor Agent
Failed
Slow-Mind-Only Agent
Success in 40.7s
Fast-Mind-Only Agent
Failed
HLA (ours)
Success in 25.3s
Ambiguous Reference Case 1: The human player says "Chop 2 Onions" at the begining of the game. 10 seconds later, the human player says "Chop it again."
No-Executor Agent
Failed
Slow-Mind-Only Agent
Success in 31.3s
Fast-Mind-Only Agent
Success in 24.3s
HLA (ours)
Success in 17.3s
Ambiguous Reference Case 2: The human player says "The third soup order should be cooked" at the begining of the game.
No-Executor Agent
Failed
Slow-Mind-Only Agent
SMOA: Success in 51.3s
Fast-Mind-Only Agent
Failed
HLA (ours)
Success in 37.5s
Here, we demonstrate the replay of 4 players in competition phase. Player-128 remains silent through the whole game play. Player-221 and Player-421 instruct AI players by texting messages while the Player-322 instructs the AI player via speaking. Cooperating with silent players, AI agent should mainly help the human player get the highest score possible. Moreover, the pace of the game is faster when use speaking to communicate, and texting can the give human player more time to think.
HLA acts swiftly, adheres to human command, and achieves highest game score in all maps, showing its superior cooperative capability than SMOA and FMOA.
Player-128 remains silent through the whole game play.
(Human: pink beard character, AI: blue character)
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)
Player-322 instructs the AI player via speaking.
Influenced by the variety of recording environments and the use of a simple version of the language recognition system, there are inherent inaccuracies present in speech recognition.
(Human: pink beard character, AI: blue character)
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)
Player-221 instructs AI players by texting messages.
(Human: pink beard character, AI: blue character)
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)
Player-421 instructs AI players by texting messages.
(Human: pink beard character, AI: blue character)
Slow-Mind-Only Agent
Fast-Mind-Only Agent
HLA (ours)