MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou1 2*, Yiran Qin1 3*
Zhenfei Yin1 4, Yuzhou Huang3, Ruimao Zhang3✉, Lu Sheng2✉, Yu Qiao1, Jing Shao1 †
1Shanghai Artificial Intelligence Laboratory; 2Beihang University; 3The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen); 4The University of Sydney
* Equal Contribution ✉ Corresponding author † Project Leader
🧐Imagination is an intriguing cognitive ability. 🤔Could we endow embodied agents with this ability to follow natural language instructions more steadily in the open-world environment? If so, what exactly might they be imagining?
Demo: Trajectory Rollouts with Imagination in Minecraft
🥳In the videos below, we demonstrate the trajectory rollouts (Left) and imagination (Right) in Minecraft.
"Chop a tree" in Minecraft
"Go Explore" in Minecraft
Abstract
It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent’s imaginative ability reveals its generalization and comprehension of the open world.
An innovative paradigm that enhances instruction-following ability of agents
Comparison between MineDreamer and previous studies. In the “Chop a tree” task, MineDreamer employs a Chain-of-Imagination mechanism, where it imagines step by step what to do next tailored to the current state. Imaginations contain environmental understanding and physical rules (e.g., perspective-based size changes). These can serve as more precise visual prompts to steadily guide the agent in generating actions to achieve these imaginations as effectively as possible at each step. Previous approaches have seen a tree but missed the opportunity to chop it down.
The Overview of Chain-of-Imagination within MineDreamer
The Overview of Chain-of-Imagination. The Imaginator imagines a goal imagination based on the instruction and current observation. The Prompt Generator transforms this into a precise visual prompt, considering both the instruction and observed image. The Visual Encoder encodes the current observation, integrates it with this prompt, and inputs this into VPT. VPT then determines the agent's next action, leading to a new observation, and the cycle continues.
The Overview Framework of Imaginator within MineDreamer
The Overall Framework of Imaginator. For the goal understanding, we add k [GOAL] tokens to the end of instruction y and input them with current observation Ot into LLaVA. Then LLaVA generates hidden states for the [GOAL] tokens, which the Q-Former processes to produce the feature f*. Subsequently, the image encoder Ev combines its output with f* in the diffusion models for instruction-based future goal imagination generation.
Demo: Manipulation Evaluation in CALVIN
🥰 In the videos below, we demonstrate the trajectory rollouts (Left) and imagination (Right) in Manipulation Evaluation.
pull the handle to open the drawer
press the button to turn on the led light
lift the red block from the sliding cabinet
push the sliding door to the right side
store the grasped block in the drawer
take the red block from the drawer
Demo: Command-Switching Evaluation in Minecraft
🥰 In the videos below, we demonstrate the performance of MineDreamer in Command-Switching Evaluation, controlled through multi-step text instructions.
Chop a tree -> Craft wooden planks
Maximum game duration: 3000 steps (2.5 minutes, FPS=20)
Switching time: at the 1500th step (1 minute and 15 seconds)
Gather dirt -> Build a tower
Maximum game duration: 3000 steps (2.5 minutes, FPS=20)
Switching time: at the 2000th step (1 minute and 40 seconds)
Obtaining Diamond: Dig down -> Mine horizontally
Maximum game duration: 12000 steps (10 minutes, FPS=20)
Switching time: agent reaches the 13th level (the second number in the Pos)
Obtaining Diamond: Dig down -> Mine horizontally
Maximum game duration: 12000 steps (10 minutes, FPS=20)
Switching time: agent reaches the 13th level (the second number in the Pos)
Demo: Programmatic Evaluation in Minecraft
🥳 In the videos below, we demonstrate the performance of MineDreamer in Programmatic Evaluation, controlled through single-step text instruction.
Go explore
Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)
Maximum Travel Distance(Blocks): 640.27
Collect seeds
Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)
Maximum Inventory Count about Seeds: 36
Chop a tree
Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)
Maximum Inventory Count about Log: 29
Collect dirt
Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)
Maximum Inventory Count about Dirt: 88
Imagination Visual Results on Evaluation Set Compared to the Baseline
🤩The images below demonstrate the generative quality of goal imagination compared to the baseline.
Imagination Visual Results During Agent Solving Open-Ended Tasks
😋The images below demonstrate the generative quality of imagination when the agent solves open-ended tasks in the simulated world.
Conclusion
In this paper, we introduce an innovative paradigm for enhancing the instruction-following ability of agents in simulated-world control. We prove that by employing a Chain-of-Imagination mechanism to envision the step-by-step process of executing instructions and translating imaginations into precise visual prompts tailored to the current state and instruction, can significantly help the foundation model follow instructions steadily in action generation. Our Agent, MineDreamer in Minecraft, showcases its strong instruction-following ability. Furthermore, we show its potential as a high-level planner's downstream controller in the challenging "Obtain diamond" task. We believe this novel paradigm will inspire future research and generalize to other domains and open-world environments.
Citation
@article{zhou2024minedreamer,
title={MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control},
author={Zhou, Enshen and Qin, Yiran and Yin, Zhenfei and Huang, Yuzhou and Zhang, Ruimao and Sheng, Lu and Qiao, Yu and Shao, Jing},
journal={arXiv preprint arXiv:2403.12037},
year={2024}
}