Official Website for
"Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games"
Abstract
Recent advancements in large language models (LLMs) have expanded their capabilities beyond traditional text-based tasks to multimodal domains, integrating visual, auditory, and textual data. While multimodal LLMs have been extensively explored for high-level planning in domains like robotics and games, their potential as low-level controllers remains largely untapped. This paper explores the application of multimodal LLMs as low-level controllers in the domain of Atari video games, introducing Atari game performance as a new benchmark for evaluating the ability of multimodal LLMs to perform low-level control tasks. Unlike traditional reinforcement learning (RL) and imitation learning (IL) methods that require extensive computational resources as well as reward function specification, these LLMs utilize pre-existing multimodal knowledge to directly engage with game environments. Our study assesses multiple multimodal LLM's performance against traditional RL agents, human players, and random agents, focusing on their ability to understand and interact with complex visual scenes and formulate strategic responses. Additionally, we examine the impact of In-Context Learning (ICL) by incorporating human-demonstrated gameplay trajectories to enhance the models' contextual understanding. Through this investigation, we aim to determine the extent to which multimodal LLMs can leverage their extensive training to effectively function as low-level controllers, thereby redefining potential applications in dynamic and visually complex environments.
Figure 1. Atari-GPT: System diagram: illustrates the integration of a multimodal large language model (LLM) as a low-level agent within the Atari gaming environment. It highlights the flow of inputs from the game to the LLM and back, demonstrating how the model processes game observations and generates corresponding actions. Additionally, the diagram includes the framework for human evaluation, which assesses the LLM’s capabilities in visual understanding, spatial reasoning, strategic intuition, and environment recognition through a structured Q&A process.
Understanding And Reasoning
Figure 2: Images used in Understanding tasks
To test the LLM’s understanding and reasoning capabilities, we conduct experiments on eight environments, as shown
in Figure 2. We created a set of prompts to investigate the models’ visual reasoning, spatial reasoning, strategic intuition, and ability to identify the environment:
Visual Understanding: Identify all the key elements in this image. Be specific. Use at most 100 words.
Spatial Reasoning: Where are the key elements located relative to each other? Be specific with respect to their position in the image. Use at most 100 words.
Strategy: The given image is a screenshot of a game. Describe the ideal next move if you were playing this game. Be specific. Use at most 100 words.
Identification: Knowing that the image came from an Atari game, identify its name. Be specific.
Figure 3 shows the percent correct outputs for each of the four tasks: visual, spatial, acceptable strategy, and identification, across two runs for each model. GPT-4o consistently performs well across all tasks, with high accuracy in visual, acceptable strategy, and identification tasks, but shows a noticeable drop in spatial reasoning accuracy. The spatial reasoning task appears to be the most challenging across all models, indicating a common area of improvement for multimodal large language models.
Figure 3: Visual, spatial, strategic and identification results. Percent average for 2 runs.
Gameplay
We collected the four roll-outs of each model and each environment and recorded their cumulative reward over 1000 steps. We then normalized each model’s performance relative to human performance and then calculated the mean performance across the seven Atari environments, both with and without In-Context Learning (ICL), as shown in Figure 4. On average, each LLM achieved between 10% and 25% of human performance. Notably, the two GPT-4o models significantly outperformed Gemini 1.5 Flash, with GPT-4 demonstrating the highest overall Atari game-playing performance. Additionally, the inclusion of demonstration examples for in-context learning had little to no impact on the average game-playing performance of these LLMs.
All videos of gameplay can be found in the Gameplay Performance Videos page.
Figure 4: Human Normalized Reward for GPT-4V Turbo, GPT-4o and Gemini 1.5 Flash with and without In-Context Learning