MrSteve: Instruction-Following Agents in Minecraft
with What-Where-When Memory
with What-Where-When Memory
Junyeong Park*¹ Junmo Cho*¹ Sungjin Ahn¹ ²
¹KAIST ² New York University
MrSteve
Abstract
Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce MrSteve (Memory Recall Steve), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, Steve-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizontal tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods, and we are releasing our code to support further research.
Method
Figure 1: MrSteve and Place Event Memory. (a) MrSteve takes agent's position, first person view, and text instruction, and utilizes Memory Module and Solver Module to follow the instruction. (b) MrSteve leverages Place Event Memory for exploration and task execution, which stores the novel events from visited places.
Figure 2: Mode Selector and VPT-Nav in MrSteve. (a) Mode Selector with Place Event Memory. It decides agent's mode (Explore or Execute) based on whether a task-relevant resource is in the memory. It uses a hierarchical read operation. (b) Architecture of Goal-Conditioned VPT Navigator.
To address the limitations of Steve-1, we devise two modules: Place Event Memory and VPT-Nav. Based on these two modules, MrSteve can explore the environment and solve tasks with memory recall, efficiently. Let's see how MrSteve solves long-horizon tasks in the following Results Section.
Results
Baselines
To verify the benefits of efficient exploration and memory system, we compared the following agents:
Steve-1
MrSteve-FM: MrSteve variant with FIFO Memory
MrSteve-PM: MrSteve variant with Place Memory
MrSteve-EM: MrSteve variant with Event Memory
PMC-MrSteve-1: MrSteve with the exploration method from Plan4MC and FIFO memory
Exploration & Navigation for Sparse Sequential Task
Video 1: First-person View and Trajectory for each Agent. MrSteve (Ours) explores a wider area, focusing on places it hasn't visited yet, compared to the two other methods during the same amount of time.
Table 1: Map Coverage and Revisit Count of different exploration policies. Our exploration method (High-Level: Count-Based, Low-Level: VPT-Nav) performs the best.
Sequential Task Solving with Memory in Sparse Condition
Figure 3: Topdown View of Two ABA-Sparse Task Maps. The first map was used in the ABA-Sparse tasks, including the water bucket task. Trees are distributed on the left side of the map, and water exists only in the upper right corner. The second map was used in tasks, including beef or wool tasks. Trees are located on the left side of the map, and on the right side, there are cows and sheeps, with a mountain separating them.
Video 2: Both agents observed the water pond at t=1461 and t=2099 each. However, Steve-1 cannot recall the water pond so it explore to find the water again. In contrary, MrSteve can recall the water pond and go to there directly. In this way, MrSteve solves three tasks notably faster than Steve-1.
Figure 4: Success Rate and Task Duration of different agents in ABA-Sparse tasks. Task A refers to the first A task in A-B-A task sequence, while Task A' refers to the final A task in the A-B-A task sequence. Each agent is given a time limit of 12K steps to complete three tasks. We note that MrSteve, as well as other memory-augmented agents, outperforms Steve-1, which lacks the memory. Additionally, while Steve-1 takes a similar amount of time to solve both task A and task A', Mr.Steve solves task A' much faster.
Memory-Constrained Task Solving with Memory
Figure 5: The overview of Memory Tasks, and Success Rate for each Memory Task from different agents. Memory Tasks are basically navigation tasks reaching the location of the previously seen experience frame. (Find Water) The agent should go to the water. (Find Zombies' Death Spot) The agent should go to the place where the zombies were burning. (Find First-Visited House) The agent should go to the first-visited house. We observe that MrSteve, which uses Place Event Memory, shows high success rates in all tasks.
Video 3: Red triangle is the agent and colored lines are places existing in the memory. Only Place Event Memory (Ours) preserves diverse places and events. (Find Water Task) FIFO memory removes the memory about water. (Find Zombies' Death Spot) Place Memory removes the memory about the burning zombies event. (Find First-Visited House) Event Memory removes the memory about the first-visited house because both houses look similar (i.e., Event Memory considers both houses as the same event).
Long-Horizon Sparse Sequential Task Solving with Memory
Figure 6: Topdown View of Two Long-Horizon Task Maps. Both maps are of size 200 ⨉ 200 blocks. (Left) This map is used for Long-Instruction task. Trees are located in the lower left and middle bottom of the map, while sheep inhabit the upper left area, and cows exist in the right bottom. A water pond can be found in the upper right area of the map. (Right) This map is utilized for the Long-Navigation task. Agents traverse between six scenes in the map. Tree of them are dynamic scenes: burning zombies, popping sugarcanes, and spawning spiders, which are positioned at the first, third, and fourth places on the map, respectively. The other three are static scenes: a water pond, trees, and a house, located at the second, fifth, and sixth places on the map, respectively.
Figure 7: (Left) The performance in Long-Instruction task. The agent is required to complete a series of tasks sequentially. The order of tasks within the sequence is randomized, with each task being one of six possible types: water bucket, beef, wool, wood, dirt, and seeds. (Right) The performance in Long-Navigation task. Before the task phase, the agent undergoes 16K-step exploration steps where it observes six landmarks: zombie burning, water, sugarcane explosion, spider spawn, tree, and house. In the task phase, the agent is given a random start image of one of these landmarks and must navigate to it. MrSteve performs well in both tasks.
MrSteve in Randomly Generated Maps
Figure 8: The performance of different agents in ABA-Random tasks and a sequential task. At the beginning of each episode, the map is randomly generated with the plains biome. For three ABA-Random tasks, task A is given as the first and last task, and task B is given as the second task. Each agent is given a time limit of 12K steps. For the sequential task, SEQ(4), the agent must solve four consecutive tasks: log, water, wool, and beef, with a time limit of 16K steps. MrSteve consistently outperforms Steve-1 in randomly generated map.
Navigation Policy Comparison
Table 2: Performance of different Low-Level Navigators. Top-2 performances are bolded. Our model, VPT-Nav with KL coefficient=1e-4, is robust to tasks with difficult terrains such as Mountain and River. Also, using large KL coefficient (e.g., 1e-2) harmed the overall performance while using small KL coefficient (e.g., 0) harmed the robustness of navigator in complex tasks.
Figure 9: First Person View of each environment for the navigation policy comparison. For flat and plains maps, the agent must navigate to random position distant from 5 to 20 blocks from the initial place, within 200 steps (10 seconds). For the mountain map, the agent must navigate to a place beyond the mountain, about 100 blocks away, within 1,000 steps (50 seconds). For the river map, the agent must navigate to a place across the river, about 40 blocks away, within 1,000 steps (50 seconds).
Video 4: Two demo videos show how each navigators overcome two difficult navigation tasks: mountain and river. (Mountain Climbing Test) VPT-Nav and Heuristic navigators climb the mountain fast, but Plan4MC has trouble when climb the mountain. (River Test) Only VPT-Nav succeed. The other two navigators drown in water.
BibTex
@inproceedings{park2025mrsteve,
title={{M}r{S}teve: Instruction-Following Agents in Minecraft with What-Where-When Memory},
author={Junyeong Park and Junmo Cho and Sungjin Ahn},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=CjXaMI2kUH}
}