MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control 

Enshen Zhou1 2*, Yiran Qin1 3* 

Zhenfei Yin1 4, Yuzhou Huang3, Ruimao Zhang3, Lu Sheng2, Yu Qiao1, Jing Shao1  

1Shanghai Artificial Intelligence Laboratory; 2Beihang University; 3The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen); 4The University of Sydney

* Equal Contribution   Corresponding author   Project Leader  

Arxiv | PDF | Code | Dataset  

All Code, Datasets, and Checkpoints are released! Come on and enjoy it!

🧐Imagination is an intriguing cognitive ability. 🤔Could we endow embodied agents with this ability to follow natural language instructions more steadily in the open-world environment? If so, what exactly might they be imagining? 

Abstract 

It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways. However, existing approaches often fail to steadily follow instructions due to difficulties in understanding abstract and sequential natural language instructions. To this end, we introduce MineDreamer, an open-ended embodied agent built upon the challenging Minecraft simulator with an innovative paradigm that enhances instruction-following ability in low-level control signal generation. Specifically, MineDreamer is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and diffusion models, and we employ a Chain-of-Imagination (CoI) mechanism to envision the step-by-step process of executing instructions and translating imaginations into more precise visual prompts tailored to the current state; subsequently, the agent generates keyboard-and-mouse actions to efficiently achieve these imaginations, steadily following the instructions at each step. Extensive experiments demonstrate that MineDreamer follows single and multi-step instructions steadily, significantly outperforming the best generalist agent baseline and nearly doubling its performance. Moreover, qualitative analysis of the agent’s imaginative ability reveals its generalization and comprehension of the open world.

An innovative paradigm that enhances instruction-following ability of agents


Comparison between MineDreamer and previous studies. In the “Chop a tree” task, MineDreamer employs a Chain-of-Imagination mechanism, where it imagines step by step what to do next tailored to the current state. Imaginations contain environmental understanding and physical rules (e.g., perspective-based size changes). These can serve as more precise visual prompts to steadily guide the agent in generating actions to achieve these imaginations as effectively as possible at each step. Previous approaches have seen a tree but missed the opportunity to chop it down.

The Overview of Chain-of-Imagination within MineDreamer


The Overview of Chain-of-Imagination. The Imaginator imagines a goal imagination based on the instruction and current observation. The Prompt Generator transforms this into a precise visual prompt, considering both the instruction and observed image. The Visual Encoder encodes the current observation, integrates it with this prompt, and inputs this into VPT. VPT then determines the agent's next action, leading to a new observation, and the cycle continues. 

The Overview Framework of Imaginator within MineDreamer


The Overall Framework of Imaginator. For the goal understanding, we add k  [GOAL] tokens to the end of instruction y and input them with current observation Ot into LLaVA. Then LLaVA generates hidden states for the [GOAL] tokens, which the Q-Former processes to produce the feature f*. Subsequently, the image encoder Ev combines its output with f* in the diffusion models for instruction-based future goal imagination generation.

Demo: Command-Switching Evaluation for Long-Horizon Tasks Following Text Instructions

🥰 In the videos below, we demonstrate the performance of MineDreamer in Command-Switching Evaluation, controlled through multi-step text instructions.

Chop a tree -> Craft wooden planks 

Maximum game duration: 3000 steps (2.5 minutes, FPS=20)

Switching time: at the 1500th step (1 minute and 15 seconds)

Gather dirt -> Build a tower 

Maximum game duration: 3000 steps (2.5 minutes, FPS=20)

Switching time: at the 2000th step (1 minute and 40 seconds)

Obtaining Diamond: Dig down -> Mine horizontally 

Maximum game duration: 12000 steps (10 minutes, FPS=20)

Switching time: agent reaches the 13th level (the second number in the Pos)

Obtaining Diamond: Dig down -> Mine horizontally 

Maximum game duration: 12000 steps (10 minutes, FPS=20)

Switching time: agent reaches the 13th level (the second number in the Pos)

Demo: Programmatic Evaluation Following Text Instructions

🥳 In the videos below, we demonstrate the performance of MineDreamer in Programmatic Evaluation, controlled through single-step text instruction. 

Go explore

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Travel Distance(Blocks): 640.27

Collect seeds

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Seeds: 36

Chop a tree

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Log: 29

Collect dirt

Maximum game duration: 3000 steps (1 minute and 40 seconds, FPS=30)

Maximum Inventory Count about Dirt: 88

Imagination Visual Results on Evaluation Set Compared to the Baseline

🤩The images below demonstrate the generative quality of goal imagination compared to the baseline.

Imagination Visual Results During Agent Solving  Open-Ended Tasks

😋The images below demonstrate the generative quality of imagination when the agent solves open-ended tasks in the simulated world.

😲 These imaginations contain the simulated world's physical rules and environmental understanding.

Conclusion

In this paper, we introduce an innovative paradigm for enhancing the instruction-following ability of agents in simulated-world control.  We prove that by employing a Chain-of-Imagination mechanism to envision the step-by-step process of executing instructions and translating imaginations into precise visual prompts tailored to the current state and instruction,  can significantly help the foundation model follow instructions steadily in action generation. Our Agent, MineDreamer in Minecraft, showcases its strong instruction-following ability.  Furthermore, we show its potential as a high-level planner's downstream controller in the challenging "Obtain diamond" task.  We believe this novel paradigm will inspire future research and generalize to other domains and open-world environments.

Citation

@article{zhou2024minedreamer,

  title={MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control},

  author={Zhou, Enshen and Qin, Yiran and Yin, Zhenfei and Huang, Yuzhou and Zhang, Ruimao and Sheng, Lu and Qiao, Yu and Shao, Jing},

  journal={arXiv preprint arXiv:2403.12037},

  year={2024}

}