Steve-Eye: Equiping LLM-based Embodied Agents with Visual Perception in Open Worlds

Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

BAAI, PKU

Github

Arxiv

Abstract

Recent studies have presented compelling evidence that large language models (LLMs) can equip embodied agents with the self-driven capability to interact with the world, which marks an initial step toward versatile robotics. However, these efforts tend to overlook the visual richness of open worlds, rendering the entire interactive process akin to ``a blindfolded text-based game.'' Consequently, LLM-based agents frequently encounter challenges in intuitively comprehending their surroundings and producing responses that are easy to understand. In this paper, we propose Steve-Eye, an end-to-end trained large multimodal model to address this limitation. Steve-Eye integrates the LLM with a visual encoder to process visual-text inputs and generate multimodal feedback. We adopt a semi-automatic strategy to collect an extensive dataset comprising 850K open-world instruction pairs, enabling our model to encompass three essential functions for an agent: multimodal perception, foundational knowledge base, and skill prediction and planning. Lastly, we develop three open-world evaluation benchmarks, then carry out experiments from a wide range of perspectives to validate our model's capability to strategically act and plan.

Contribution

Open-World Instruction Dataset: We construct an extensive instruction dataset to train Steve-Eye for the acquisition of three mentioned functions. The instruction data contains not only the agent’s per-step status and environmental features but also the essential knowledge for agents to act and plan.

Large Multimodal Model and Training: Steve-Eye combines a visual encoder which converts visual inputs into a sequence of embeddings, along with a pre-trained LLM which empowers embodied agents to engage in skill or task reasoning in an open world.

Open-World Benchmarks: We carry out extensive experiments to demonstrate that Steve-Eye outperforms LLM-based agents in open-world setups. Specifically, we develop the following benchmarks to evaluate agent performance from a broad range of perspectives: (1) environmental visual captioning (ENV-VC); (2) foundational knowledge question answering (FK-QA); (3) skill prediction and planning (SPP).

Model Overview

Figure 1: Steve-Eye is a large multimodal model designed to seamlessly process both visual and language inputs. The model excels in acquiring fundamental knowledge of the world it lives in, understanding the nuances of its surroundings, and generating executable plans to complete a wide array of open-ended tasks. Furthermore, Steve-Eye responds to user instructions through either visual or text-based cues, enhancing the convenience and flexibility of human-AI interaction.

Open-world Instruction-following Dataset

Figure2: Multimodal Perception Instructions

Figure3: Foundational Knowledge Instructions: item icons

Figure4: Foundational Knowledge Instructions: recipes

Multimodal Perception Instructions: Human players can perform actions in Minecraft mainly relying on their visual perception, without any prior hints or imposed game judgments. In order to endow Steve-Eye with the same ability. As shown in Figure2, we collect 200K instructional pairs for multimodal perception learning.

Foundational Knowledge Instructions: Embodied agents require a foundation of essential knowledge to facilitate action-taking and skill planning. In Minecraft, such knowledge should contain item recipes, details of item attributes, their associated numerical value, etc. We access this vital information from Minecraft-Wiki Site. In total, we collect a high-quality dataset with 250K foundational knowledge instructions as illustrated in Figure 3 and 4.

Skill-related Interaction Instructions: The environmental description and foundational knowledge serve as prerequisites for an agent’s interaction within the open world. However, a successful interaction requires more than these elements alone. It relies upon the mastery of basic skills, such as log, harvesting, and food preparation, as well as high-level skill planning abilities to tackle complex, long-horizon tasks, such as crafting an iron pickaxe. To facilitate this, we gather corresponding training data with 200K instructional pairs for skill prediction and planning

Experiments on ENV-VC and FK-QA