Dogyu Ko*, Chanyoung Yeo*, Daeho Kim, Jaeho Kim, and Hyoseok Hwang
{kodogyu, ducksdud08, kdh2769, leokim51, hyoseok}@khu.ac.kr
IEEE Robotics and Automation Letters (RA-L) 2025
Abstract
Enabling robots to interact effectively with the real world requires extensive learning from physical interaction data, making simulation crucial for generating such data safely and cost-effectively. Despite the advantages of simulation, manual environment creation remains a laborious process, motivating the development of automated generation approaches. However, the limitations of current automatic virtual scene generation approaches in bridging the sim-to-real gap and achieving task readiness necessitate the creation of automatically generated, realistic, and task-ready virtual scenes. In this paper, we propose GAIA, a novel methodology to automatically generate interactive, task-ready simulation environments grounded in real contexts from only a single RGB image and a task instruction. GAIA utilizes a pre-trained Vision-Language Model (VLM) without requiring explicit training, and jointly understands the visual context and the user's instruction. Based on this understanding, it infers and places necessary task-aware objects, including unseen ones to construct an interactive virtual environment that maintains real-scene fidelity while reflecting task requirements without additional manual setup. We show qualitative experiments that GAIA generates spaces consistent with user instructions, and quantitative results that policies learned within these GAIA-generated environments successfully transfer to target environments.
Overview
Overview of the GAIA framework. GAIA uses a VLM to interpret spatial context from an image and semantic intent from a task instruction. Based on this understanding, it automatically retrieves the necessary 3D assets to build an interactive simulation ready for embodied AI.
Results of Virtual-Scene Generation
We show qualitative results of the task-ready virtual scenes generated by GAIA from input images and various task instructions.
GAIA
Heuristic baseline
The Heuristic baseline was manually constructed by a human operator who, given a target image and task instruction, built a similar scene where the task was executable. To ensure a fair comparison, the operator was restricted to the same 3D asset dataset used by GAIA.
Given Scenes and Tasks
To evaluate the model’s ability to generate feasible simulation scenes, we curated a set of 20 tasks for our Text-to-Image evaluation. These tasks are derived from ManiSkill-HAB, a comprehensive benchmark designed for low-level manipulation and object rearrangement in realistic home environments. ManiSkill-HAB deconstructs abstract, long-horizon objectives (e.g., TidyHouse, SetTable) into clear, actionable subtasks. Our task set leverages core skills from this benchmark, such as ‘Pick’ and ‘Place’, and additionally ‘Move’ skill to assess the model’s ability to generate scenes from more general instructions.
Scenes generated by GAIA
Scenes generated by RoboGen
We evaluated the task success rate of the learned policies over 500 trials per task in the test space.
Real-world experiment setup
Success rate of the transfer learning on the task "Pick up the water bottle on the desk"
To evaluate the effectiveness of sim-to-real transfer, we compared two distinct policies: one trained in virtual environments generated by RoboGen and another using scenes from GAIA. Both policies, designed for the task "Pick up the water bottle on the desk", were then deployed on a Franka Research 3 manipulator equipped with a 2-finger gripper.
Input real-world scene
GAIA-generated virtual scene
Real-world experiment setup
To evalute transferability of the GAIA framework to a long-horizon task, we extended our evaluation to "Bring me the pencil case from the cabinet" which involved opening a cabinet and then taking out the pencil case. Following the real-to-sim-to-real pipeline, we trained policies in a GAIA-generated virtual scene. This policy, which executed separate opening and retrieving actions sequentially, achieved a 40% success rate over 10 trials.
Please use the following BibTeX entry to cite this work:
@article{koandyeo2025GAIA,
title={GAIA: Generating Task Instruction Aware Simulation Grounded in Real Contexts using Vision-Language Models},
author={Ko, Dogyu and Yeo, Chanyoung and Kim, Daeho and Kim, Jaeho and Hwang, Hyoseok},
journal={IEEE Robotics and Automation Letters},
year={2025},
publisher={IEEE}
}