The foundation agent is one of the promising ways to achieve Artificial General Intelligence. Recent studies have demonstrated its success in specific tasks or scenarios. However, existing foundation agents cannot generalize across different scenarios, mainly due to their diverse observation and action spaces and semantic gaps. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking the screen (and possibly audio) of the computer as input and keyboard and mouse operations as output, similar to human-computer interaction. To target GCC, we propose CRADLE, which has strong reasoning abilities, including self-reflection, task inference, and skill curation, to ensure its generalizability and self-improvement across various computer tasks. To demonstrate the capabilities of CRADLE, we deploy it in the famous AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC. Our agent can follow the main storyline and finish real missions in this complex AAA game1 , with minimal reliance on prior knowledge.
The CRADLE framework empowers nascent foundation models to perform complex computer tasks via the same general interface humans use: screen as input and keyboard & mouse operations as output.
In this work, we introduce CRADLE, a novel framework targeting GCC, and make a preliminary attempt towards it. Distinctly from previous methods focused on controlling web browser or software with easy access to their internal APIs and states, CRADLE only takes the screen as input and outputs keyboard and mouse operations through a comprehensive reasoning process, enabling it to effectively understand the current situation and make reasonable decisions. Furthermore, we deploy CRADLE in the highly acclaimed AAA game3 , Red Dead Redemption II (RDR2). We select RDR2 for our case study due to its complex blackbox control system, which epitomizes the most demanding computer tasks. This enables us to evaluate the performance boundaries of our framework in such virtual environments. RDR2 is characterized by its rich and diverse information, encompassing elements like dialogues, unique icons, in-game prompts, and instructions, thus requiring the capture and interpretation of various information forms. Additionally, the game requires a broader range of keyboard and mouse interactions than typical software, such as using the mouse for navigation and using mouse buttons and keyboard keys together to realize actions in the game world (which necessitate not only precise key selection, but also the determination of key-holding duration). We argue that by demonstrating the feasibility of playing this complex game, CRADLE sheds light on its potential for GCC.
Our major contributions are summarized as follows:
• We propose the novel setting of General Computer Control (GCC), serving as a milestone towards AGI in the digital world, where agents take multimodal observations as inputs and output keyboard and mouse operations, similar to human-computer interactions.
• We propose a novel foundation agent framework (CRADLE) for the GCC setting, which has strong reasoning abilities, including self-reflection, task inference, and skill curation, to ensure its generalizability and self-improvement across various computer tasks.
• To demonstrate the capabilities of CRADLE, we incorporate the powerful Large Multimodal Model (LMM) GPT-4V into our framework and deploy it in the famous AAA game RDR2, serving as a preliminary attempt towards GCC. To our best knowledge, our work is the first to enable LMM-based agents to follow the main storyline and finish real missions in complex AAA games, with minimal reliance on prior knowledge.
An overview of the CRADLE framework. CRADLE takes video from the computer screen as input and outputs computer keyboard and mouse control determined through inner reasoning.
General Computer Control (GCC) is defined as a setting where the agent is designed to control a computer through human-like interactions. It takes video of the screen, text (e.g., from the command line or shown on the screen), and audio (e.g., instructions, effects, and music) from the computer as input and performs basic keyboard and mouse operations as output. These operations include pressing, holding, and releasing keyboard keys, moving the mouse, and clicking mouse buttons. Moreover, the combination of keys, mouse moving and clicking, and timing, introduces additional complexity to control a computer.
The uniqueness of GCC lies in its generality to interact with any software via the universal observation/action spaces, irrespective of access to its source code, automation tools, or inner API availability. Thus, one of the main challenges in GCC is to enable agents to understand and adapt to different software environments and interfaces, which often have varying layouts, functionalities, and icons. This requires the development of sophisticated perception, allowing agents to interpret visual and auditory cues accurately.
To master various computer tasks, agents need to have the ability to explore unseen environments in a structured manner to discover new strategies and solutions autonomously. Agents must remember past actions and their outcomes, understand complex sequences of tasks and their effects, and apply this knowledge to solve future problems. Agents also need to continuously improve over time through the acquisition of new knowledge and skills.
The GCC setting can serve as a standardized testbed for evaluating the generalization capabilities of foundation agents across diverse environments. Moreover, GCC can potentially revolutionize how AI is applied in various fields. For instance, it can aid in automating tasks across different software platforms, assisting users with disabilities, and enhancing human-computer interaction research. Finally, achieving GCC can be seen as a milestone of AGI in the digital world.
ENVIRONMENT IO
Information Gathering. To capture all information relevant to understanding the recent situation and perform further reasoning, CRADLE takes a video recording since the last executed action as input and needs to make sure information is properly extracted from it, which includes both textual and visual information. Textual information usually includes content (headings, paragraphs), navigation labels (menus, links), notifications, and instructions to convey messages and guide users, which usually depend on the OCR ability of LMM models. On the other hand, visual information includes layout, imagery (visual contents of the screenshot itself, icons), animations, and UI elements to enhance user experience and interface design, which poses high requirements for the spatial perception and visual understanding of LMM models.
Action Execution. Under the GCC setting, the only way to interact with the environments is through keyboard and mouse operations. To bridge the gap between actions outputted by the framework and the operating-system-level executable actions, CRADLE uses the LMM to generate code as semantic-level skills which encapsulate lower-level keyboard and mouse control, e.g., key_hold(‘E’, duration) or move_mouse(x, y). We then let the LMM instantiate these skill functions into executable form by specifying any necessary parametric aspects (e.g., duration, position, speed). An Executor is then triggered to map these semantic actions to the final OS-level keyboard and mouse commands to interact with the environment.
REASONING
Based on the extracted information from the video observations and relevant information from its memory, CRADLE needs to reason taking into account incomplete information and semantic gaps and then make the next decision. This process is analogous to "reflect on the past, summarize the present, and plan for the future", which is broken down into the following modules.
Self-Reflection. The reflection module initially evaluates whether the last executed action was successfully carried out and whether the task was completed. Sequential key screenshots from the last video observation, along with the previous context for action planning and task inference are fed to the LMM for reasoning. Additionally, we also request the LMM to provide an analysis of any failure. This valuable information enables CRADLE to try and remedy inappropriate decisions or less-than-ideal actions. Furthermore, reflection can also be leveraged to inform re-planning and bring the agent closer to target task completion, better understand the factors that led to previous successes, or suggest how to update/improve specific skills.
Task Inference. After reflecting on the outcome of the last step, CRADLE needs to analyze the current situation to infer the most suitable task for the current moment. We let the LMM estimate the highest priority task to perform and when to stop an ongoing task and start a new one, prompting it with the key screenshots, the long-term summary of past experiences, and the latest reflection results.
Skill Curation. As a new task is selected, CRADLE needs to prepare the tactics to accomplish it, by retrieving useful skills from the procedural memory, updating skills, or generating new ones. Skills in CRADLE are represented as code functions; a form is both flexible and interpretable, and easy for LMMs to understand and maintain. Atomic skills are usually made up of simple calls to keyboard and mouse control, e.g., key_press() to press a given key, which can then be extended and rewritten into more complex composite skills.
Action Planning. To complete a given task, the LMM needs to select the appropriate skills from the curated skill set and instantiate these skills into a sequence of executable actions by specifying any necessary parametric aspects (e.g., duration, position, and target) according to the inferred task, last action, and long-term summary. The action sequence is then fed to the Executor for interaction with the environment. It is important to note that in complex digital games, the effective action space is composed not only of key/mouse function calls per se, but also involves timing and cross-action interaction, semantic mapping of term usage on screen to action-specific details, among other factors. Moreover, performing actions in the environment is non-trivial also as mapping code execution to its effects is not always explicit. That is, action execution can result in no error from the code or game, but still be incorrect or ineffective.
MEMORY
CRADLE stores and maintains all the useful information provided by the environment and the LMM’s output through a memory module, consisting of episodic memory and procedural memory.
Episodic Memory. Episodic memory is used to maintain current and past experiences, including key screenshots from each video observation, and all LMM output, e.g., textual and visual information, actions, tasks, and reasoning from each module. To facilitate retrieval and storage, periodical summarization is conducted to abstract recently added multimodal information into long-term summaries. The incorporation of episodic memory enables CRADLE to effectively retain crucial information over extended periods, thereby enhancing its decision-making capabilities.
Procedural Memory. This memory is specific to storing and retrieving skills in code form, which can be learned from scratch or pre-defined in procedural memory. Upon skill curation, skills can be added, updated, or composed in the memory. The most relevant skills for a given task and situation will be retrieved to support action planning, so, as CRADLE continuously acquires new skills during interaction, it is critical that this memory can effectively calculate skill relevance.
The detailed illustration of how CRADLE is instantiated as a game agent to play RDR2.
We deploy CRADLE as a game-playing agent in the renowned AAA game RDR2, showcasing our framework for GCC with a complex target. To the best of our knowledge, this is the first work to explore such rich games in a general challenge setting, without access to any internal game state or API (i.e., the agent has to interact with the game in a human-like manner). RDR2 is a typical 3D RPG-like game with classical keyboard and mouse controls and movie-like realistic graphics. The game’s progression unveils a mix of story content, dialogues and instructions, and helpful tips for interactive game mechanisms, which are leveraged by our agent to independently expand its skill library from scratch. Demonstrating our agent’s capability to navigate the world and complete tasks following the main storyline in RDR2 underscores the significant potential of our framework in advancing towards GCC.
Objective. To emulate a new human player learning to play the game, including in-game tutorials and hints, we mainly focus on the first half of Chapter I. This section of the game consists of two concrete tasks, Explore Shelter and Rescue John, typically with a gameplay duration of around 40 minutes for a human player. It is a super long-term task and seldom previous work attempted this kind of challenge.
Observations and Action Space. Strictly following the GCC setting, our agent takes the video of the screen as input and outputs keyboard and mouse operations to play the game. To lower the frequency of interaction with the backbone model, the video recorder takes a game screenshot 6 Under review as a conference paper at ICLR 2024 every 0.5 seconds, which proves to be sufficient for information gathering without missing any important information. For the action space, we categorize the keyboard and mouse actions into 4 key categories: press(key, duration), hold(key, duration), release(key), and pointer_move(x, y), which can be combined in different ways to form combos, use keys in fast sequence, or coordinate timings. Skill code needs to be generated by the agent in order to utilize such functions and affordances so executed actions take effect.
Skill code generation based on in-game instructions. As the storyline progresses, the game will continually provide prompts on how to use a new skill via keystrokes or utilizing the mouse.
Skill Curation. For skill curation, we first provide GPT-4V with examples of general mouse and keyboard control APIs, e.g., io_env.key_press and io_env.mouse_click. Figure 5 shows that GPT-4V can capture and understand the prompts appearing on screenshots, i.e., icons and text, and strictly follow the provided skill examples using our IO API to generate correct skill code. Moreover, GPT4V also generates comments in the code to demonstrate the functionality of this skill, which are essential for computing similarity and relevance with a given task during skill retrieval. The quality of the generated comment directly determines the results of skill retrieval, and further impacts reasoning to action planning. Curation can also re-generate code for a given skill, which is useful if GPT-4V wrongly recognized a key or mouse button in a previous iteration.
Self-Reflection. Self-reflection is an essential component in CRADLE as it allows our framework reasoning to correct previous mistakes or address ineffective actions taken in-game. Figure 6 provides an example of the self-reflection module. The task requires the agent to select a weapon to equip, in the context of the “Protect Dutch" task. Initially, the agent selects a knife as its weapon by chance, but since the game requires a gun to be chosen, this is incorrect and the game still prompts the player to re-open the weapon wheel. The self-reflection module is able to determine that the previous action was incorrect and on a subsequent iteration the agent successfully opts for the gun, correctly fulfilling the task requirement and advancing to the next stage in the story.
Case study of self-reflection on re-trying a failed task. Task instruction and context require the agent to equip the gun. A wrong weapon (knife) is first selected, but the agent equips the gun after self-reflection. Only relevant modules are shown for better readability, though all modules are executed per iteration.
Quantitative Evaluation. To illustrate the effectiveness and importance of the different modules in CRADLE to its overall performance, we evaluate the framework on six representative sub-tasks from the first chapter of the main storyline, compared with two ablation-like baselines: CRADLE without Self-Reflection and CRADLE without Task Inference. Except for the Protect Dutch sub-task that involves a fast-paced gun battle, and Search house, which requires the agent to explore a complex indoor environment, CRADLE can complete all sub-tasks consistently. Moreover, even for these two complex tasks, CRADLE also achieves a significantly higher success rate than disabling either Self-Reflection or Task Inference. Without task inference, the agent can still leverage the extracted instructional text information and long-term summary to infer both task and goal in the action planning process, which can lead to some successful task completions. However, over time the instructional text disappears from screen and the long-term summary will be diluted as relevant information will be forgotten due to the recently added entries, which explains why it fails in the sub-tasks of Go to shed, Protect Dutch, and Lead horse, but still has a reasonable success rate in the Equip shotgun task, where the task has a shorter horizon and its instructional text appears at the bottom of the screen. On the other hand, without self-reflection, CRADLE struggles with movement, especially when the agent is blocked by obstacles, which is difficult for action planning to notice and address. Without an independent critic to reflect on the current situation, GPT-4V tends to still trust its previous reasoning and believe that the actions were carried out successfully. As Search House is a complex indoor task with various pieces of furniture as obstacles, we can see CRADLE without Self-Reflection exhibits extremely poor performance in it, as expected.
CRADLE performance on six representative sub-tasks in RDR2. Every sub-task is tested five times with a maximum of ten minutes in-game time.
In this work, we introduce GCC, a general, challenging setting aimed to pave the way towards more general foundation agents across computer tasks. Additionally, we propose a novel framework (CRADLE) that enables LMM-based agents to work in such an impactful setting, and we further showcase its effectiveness in the famous AAA game, RDR2. CRADLE exhibits strong performance in learning skills, following the storyline, and finishing the real missions in the game. CRADLE serves as a pioneering work to develop more powerful LMM-based general agents across computer control tasks.