LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited interaction abilities.

Fig. 1 demonstrate a case showing the three challenges.

Real-time Responsiveness. Chop immediately after receiving the command.
Command Reasoning. Infer that"One more!" refers to "tomato".
Bilateral Communication. Generate a response "I've chopped 3".

Figure 1: A Case Demostrating Challenges

Figure 2: Framework of HLA

In this work, we propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. HLA consists of three parts, as depicted in Fig. 2.

Slow Mind: A proficient LLM for intention reasoning and language interaction
Fast Mind: A lightweight LLM for generating macro actions
Executor: A script policy for transforming macro actions into atomic actions

2. Testbed: the Overcooked Game

Overcooked is a cooperative cooking game where participants must work together to prepare, cook, and promptly serve a variety of dishes.

The cooking process is depicted in Fig. 3. We also design 4 distinct maps as shown in Fig. 4.

Figure 3: Game Process

Figure 4: Maps: Ring, Bottleneck, Partition, Quick

3. Experiments Setup

We introduce three baseline agents, each lacking a certain component of the original HLA.

Slow-Mind-Only Agent (SMOA). We remove the Fast Mind and let the Slow Mind produce macro actions.
Fast-Mind-Only Agent (FMOA). We remove the Slow Mind and let the Fast Mind generate chat message.
No-Executor Agent (NEA). We remove the Executor and let the Fast Mind choose atomic actions to control the agent directly.

We visualize some of the experiments (full experiments are shown in the paper), including simple command, complex command and human studies.

In the visualization results:

The game score can be found in the bottom-left corner of the screenshot.
The blue text within the screenshot shows the human player chat message and AI agent chat response.
The pink beard charactor is controlled by human, and the blue charactor is controlled by AI.

4. Visualization Results of Simple Commands

Simple commands are designed to test the AI's ability of achieving high game score, either in the absence of explicit human commands or after human commands is fulfilled. We design 2 test cases here, i.e. No Command and One Command.

4.1 No Command

In No Command test case, the human player only chops ingredients and does not issue any command throughout the game.

As shown in the visual results, NEA is stuck at the top left corner of the map. This is because NEA continuously gives the atomic action of moving "left" and is stuck next to the tomato tile. SMOA successfully mixes the ingredients and cooks soups. But upon completion of the game, the AI player fails to plate the soup in time, and thus overcooks the soup and sets the pot on fire. FMOA gives hallucinating chat response. At 2/3 of gameplay, FMOA talks about "the burning pot", but no pot is on fire at that time. HLA act most swiftly amongst all, exhibiting its real-time action ability.

(Human: pink beard character, AI: blue character)

No-Executor Agent

Slow-Mind-Only Agent

Fast-Mind-Only Agent

HLA (ours)

4.2 One Command

In One Command test case, the human player only chops ingredients and asks the AI agent to prepare Bob Soup at the start of the game. The order for Bob Soup only appears once at the start of the game.

As shown in the visual results, NEA is still stuck at the top left corner of the map. SMOA overcooks the soup again. FMOA gives hallucinating chat response. At 2/3 of gameplay, FMOA still talks about "Bob Soup is almost ready", but the Bob Soup is already served prior and there is no Bob Soup order at that moment. HLA adheres human command accurately and focuses on maximaxing game score once the command is satisfied, showing its superior cooperative capability.

(Script tester: pink beard character, AI: blue character)

No-Executor Agent

Slow-Mind-Only Agent

Fast-Mind-Only Agent

HLA (ours)

5. Visualization Results of Complex Commands

Complex commands are designed to test the AI's ability to comprehend and respond to commands of varying complexity. There are 3 types of challenges for complex commands, i.e. Quantity Specification, Semantic Analysis and Ambiguous Reference. We demonstrate 2 test case for each challenge.

NEA is stuck and cannot perform any task. SMOA can finish most of the commands, but takes much longer time than HLA. FMOA struggles to finish most of the commands. HLA can follow human commands accurately and swiftly, showing its strong command reasoning ability and real-time acting capability.

(Script tester: pink beard character, AI: blue character)

5.1 Quantity Specification

Quantity Specification Case 1: The human player says "3 chopped tomatoes please." at the begining of the game.

No-Executor Agent

Failed

Slow-Mind-Only Agent

Failed

Fast-Mind-Only Agent

Failed

HLA (ours)

Success in 42.9s

Quantity Specification Case 2: The human player says "2 lettuces chop." at the begining of the game.

No-Executor Agent

Failed

Slow-Mind-Only Agent

Success in 39.3s

Fast-Mind-Only Agent

Failed

HLA (ours)

Success in 19.7s

5.2 Semantic Analysis

Semantic Analysis Case 1: The human player says "Chop but except tomato and lettuce." at the begining of the game.

No-Executor Agent

Failed

Slow-Mind-Only Agent

Success in 12.7s

Fast-Mind-Only Agent

Success in 17.5s

HLA (ours)

Success in 10.1s

Semantic Analysis Case 2: The human player says "Oh got, I forget the alice soup order." at the begining of the game.

No-Executor Agent

Failed

Slow-Mind-Only Agent

Success in 40.7s

Fast-Mind-Only Agent

Failed

HLA (ours)

Success in 25.3s

5.3 Ambiguous Reference

Ambiguous Reference Case 1: The human player says "Chop 2 Onions" at the begining of the game. 10 seconds later, the human player says "Chop it again."

No-Executor Agent

Failed

Slow-Mind-Only Agent

Success in 31.3s

Fast-Mind-Only Agent

Success in 24.3s

HLA (ours)

Success in 17.3s

Ambiguous Reference Case 2: The human player says "The third soup order should be cooked" at the begining of the game.

No-Executor Agent

Failed

Slow-Mind-Only Agent

SMOA: Success in 51.3s

Fast-Mind-Only Agent

Failed

HLA (ours)

Success in 37.5s

6. Visualization Results of Human Studies

Here, we demonstrate the replay of 4 players in competition phase. Player-128 remains silent through the whole game play. Player-221 and Player-421 instruct AI players by texting messages while the Player-322 instructs the AI player via speaking. Cooperating with silent players, AI agent should mainly help the human player get the highest score possible. Moreover, the pace of the game is faster when use speaking to communicate, and texting can the give human player more time to think.

HLA acts swiftly, adheres to human command, and achieves highest game score in all maps, showing its superior cooperative capability than SMOA and FMOA.

6.1 Player 128: Map Ring

Player-128 remains silent through the whole game play.

(Human: pink beard character, AI: blue character)

Slow-Mind-Only Agent

Fast-Mind-Only Agent

HLA (ours)

6.2 Player 322: Map Bottleneck

Player-322 instructs the AI player via speaking.

Influenced by the variety of recording environments and the use of a simple version of the language recognition system, there are inherent inaccuracies present in speech recognition.

(Human: pink beard character, AI: blue character)