GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts using Vision-Language Models

Dogyu Ko*, Chanyoung Yeo*, Daeho Kim, Jaeho Kim, and Hyoseok Hwang

{kodogyu, ducksdud08, kdh2769, leokim51, hyoseok}@khu.ac.kr

Kyung Hee University AIRLAB

IEEE Robotics and Automation Letters (RA-L) 2025

Paper

Code

GAIA RA-L video materal final - project page.mp4

Abstract

Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the user's instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that maintains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA-generated environments successfully transfer to target environments.

Overview

Overview of the GAIA framework. GAIA uses a VLM to interpret spatial context from an image and semantic intent from a task instruction. Based on this understanding, it automatically retrieves the necessary 3D assets to build an interactive simulation ready for embodied AI.

Results of Virtual-Scene Generation

We show qualitative results of the task-ready virtual scenes generated by GAIA from input images and various task instructions.

GAIA

Heuristic baseline

The Heuristic baseline was manually constructed by a human operator who, given a target image and task instruction, built a similar scene where the task was executable. To ensure a fair comparison, the operator was restricted to the same 3D asset dataset used by GAIA.

Given Scenes and Tasks

To evaluate the model’s ability to generate feasible simulation scenes, we curated a set of 20 tasks for our Text-to-Image evaluation. These tasks are derived from ManiSkill-HAB, a comprehensive benchmark designed for low-level manipulation and object rearrangement in realistic home environments. ManiSkill-HAB deconstructs abstract, long-horizon objectives (e.g., TidyHouse, SetTable) into clear, actionable subtasks. Our task set leverages core skills from this benchmark, such as ‘Pick’ and ‘Place’, and additionally ‘Move’ skill to assess the model’s ability to generate scenes from more general instructions.

Sim-to-Sim Policy Learning

Train scenes for policy learning

Scenes generated by GAIA

Scenes generated by RoboGen

Test scenes for policy learning

We evaluated the task success rate of the learned policies over 500 trials per task in the test space.

Sim-to-Real Policy Transfer

Real-world policy transfer

Real-world experiment setup

Success rate of the transfer learning on the task "Pick up the water bottle on the desk"

To evaluate the effectiveness of sim-to-real transfer, we compared two distinct policies: one trained in virtual environments generated by RoboGen and another using scenes from GAIA. Both policies, designed for the task "Pick up the water bottle on the desk", were then deployed on a Franka Research 3 manipulator equipped with a 2-finger gripper.

Policy transfer for a long-horizon task

Input real-world scene

GAIA-generated virtual scene

Real-world experiment setup

To evalute transferability of the GAIA framework to a long-horizon task, we extended our evaluation to "Bring me the pencil case from the cabinet" which involved opening a cabinet and then taking out the pencil case. Following the real-to-sim-to-real pipeline, we trained policies in a GAIA-generated virtual scene. This policy, which executed separate opening and retrieving actions sequentially, achieved a 40% success rate over 10 trials.

Citation

Please use the following BibTeX entry to cite this work:

@article{koandyeo2025GAIA,

title={GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts using Vision-Language Models},

author={Ko, Dogyu and Yeo, Chanyoung and Kim, Daeho and Kim, Jaeho and Hwang, Hyoseok},

journal={IEEE Robotics and Automation Letters},

year={2025},

publisher={IEEE}

}