MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

Zhenfei Yin1 4, Yuzhou Huang3, Ruimao Zhang3✉, Lu Sheng2✉, Yu Qiao1, Jing Shao1 †

1Shanghai Artificial Intelligence Laboratory; 2Beihang University; 3The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen); 4The University of Sydney

* Equal Contribution ✉ Corresponding author † Project Leader

NeurIPS 2024 Workshop on Open-World Agents

Arxiv | PDF | Code | Dataset

All Code, Datasets, and Checkpoints are released! Come on and enjoy it!

🧐Imagination is an intriguing cognitive ability. 🤔Could we endow embodied agents with this ability to follow natural language instructions more steadily in the open-world environment? If so, what exactly might they be imagining?

Demo: Trajectory Rollouts with Imagination in Minecraft

🥳In the videos below, we demonstrate the trajectory rollouts (Left) and imagination (Right) in Minecraft.

"Chop a tree" in Minecraft

"Go Explore" in Minecraft

Abstract

It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent’s imaginative ability reveals its generalization and comprehension of the open world.

An innovative paradigm that enhances instruction-following ability of agents

Comparison between MineDreamer and previous studies. In the “Chop a tree” task, MineDreamer employs a Chain-of-Imagination mechanism, where it imagines step by step what to do next tailored to the current state. Imaginations contain environmental understanding and physical rules (e.g., perspective-based size changes). These can serve as more precise visual prompts to steadily guide the agent in generating actions to achieve these imaginations as effectively as possible at each step. Previous approaches have seen a tree but missed the opportunity to chop it down.

The Overview of Chain-of-Imagination within MineDreamer

The Overview of Chain-of-Imagination. The Imaginator imagines a goal imagination based on the instruction and current observation. The Prompt Generator transforms this into a precise visual prompt, considering both the instruction and observed image. The Visual Encoder encodes the current observation, integrates it with this prompt, and inputs this into VPT. VPT then determines the agent's next action, leading to a new observation, and the cycle continues.

The Overview Framework of Imaginator within MineDreamer

The Overall Framework of Imaginator. For the goal understanding, we add k [GOAL] tokens to the end of instruction y and input them with current observation Ot into LLaVA. Then LLaVA generates hidden states for the [GOAL] tokens, which the Q-Former processes to produce the feature f*. Subsequently, the image encoder Ev combines its output with f* in the diffusion models for instruction-based future goal imagination generation.

Demo: Manipulation Evaluation in CALVIN

🥰 In the videos below, we demonstrate the trajectory rollouts (Left) and imagination (Right) in Manipulation Evaluation.

pull the handle to open the drawer

press the button to turn on the led light

lift the red block from the sliding cabinet

push the sliding door to the right side

store the grasped block in the drawer

take the red block from the drawer

Demo: Command-Switching Evaluation in Minecraft

🥰 In the videos below, we demonstrate the performance of MineDreamer in Command-Switching Evaluation, controlled through multi-step text instructions.

Chop a tree -> Craft wooden planks

Maximum game duration: 3000 steps (2.5 minutes, FPS=20)

Switching time: at the 1500th step (1 minute and 15 seconds)

Gather dirt -> Build a tower

Maximum game duration: 3000 steps (2.5 minutes, FPS=20)

Switching time: at the 2000th step (1 minute and 40 seconds)

Obtaining Diamond: Dig down -> Mine horizontally

Maximum game duration: 12000 steps (10 minutes, FPS=20)

Switching time: agent reaches the 13th level (the second number in the Pos)

Obtaining Diamond: Dig down -> Mine horizontally

Maximum game duration: 12000 steps (10 minutes, FPS=20)

Switching time: agent reaches the 13th level (the second number in the Pos)

Demo: Programmatic Evaluation in Minecraft

🥳 In the videos below, we demonstrate the performance of MineDreamer in Programmatic Evaluation, controlled through single-step text instruction.

Go explore

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Travel Distance(Blocks): 640.27

Collect seeds

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Seeds: 36

Chop a tree

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Log: 29

Collect dirt

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Dirt: 88

Imagination Visual Results on Evaluation Set Compared to the Baseline

🤩The images below demonstrate the generative quality of goal imagination compared to the baseline.

Imagination Visual Results During Agent Solving Open-Ended Tasks

😋The images below demonstrate the generative quality of imagination when the agent solves open-ended tasks in the simulated world.

Conclusion

In this paper, we introduce an innovative paradigm for enhancing the instruction-following ability of agents in simulated-world control. We prove that by employing a Chain-of-Imagination mechanism to envision the step-by-step process of executing instructions and translating imaginations into precise visual prompts tailored to the current state and instruction, can significantly help the foundation model follow instructions steadily in action generation. Our Agent, MineDreamer in Minecraft, showcases its strong instruction-following ability. Furthermore, we show its potential as a high-level planner's downstream controller in the challenging "Obtain diamond" task. We believe this novel paradigm will inspire future research and generalize to other domains and open-world environments.

Citation

@article{zhou2024minedreamer,

title={MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control},

author={Zhou, Enshen and Qin, Yiran and Yin, Zhenfei and Huang, Yuzhou and Zhang, Ruimao and Sheng, Lu and Qiao, Yu and Shao, Jing},

journal={arXiv preprint arXiv:2403.12037},

year={2024}

}