Anonymous Authors
Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in generative simulations to generate feasible robotic tasks autonomously. Nevertheless, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations.
Our framework segments evaluation into three core components: task generation quality, task diversity and generalizability. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For task quality, we evaluate the realism of the generated task as well as the completeness of the generated trajectories with large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories respectively. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks.
Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works.
We leverage LLMs and VLMs to derive scene alignment score, which measures the consistency between the rendered scene and real-world task, and task completion score, referring to whether a task is completed overcoming the limits of traditional hard-coded methods.
To validate the efficacy of our method, we examine the consistency of our results with human evaluations for ten released tasks from RoboGen and GenSim. The score is calculated though dividing the Pearson correlation coefficient by the mean absolute error (MAE), where higher values indicate greater similarity.
Scene Alignment: 3.96/5
Task Completion: 7.96/10
In this task, the laptop is correctly placed on the table, and there are some objects such as a lamp and a pen placed on the desk. Meanwhile, the task is basically completed.
Scene Alignment: 3.6/5
Task Completion: 5.8/10
In this task, a refrigerator, trash bin, and oven appeared in the same scene, aligning with real-world kitchens. However, the robot did not accurately grasp the oven and instead used other parts to attempt to complete the task.
Scene Alignment:
3.36/5 (gpt4v)
1.4/5(blip2+gpt4v)
Task Completion: 7.64/10
In this task, the faucet was positioned on a metal table, which greatly deviates from its typical real-world location. The robot accurately located and operated the faucet, though.
Scene Alignment: 3.8/5
Task Completion: 7.8/10
In this task, the scene contains various blocks and a container, which aligns the task settings, and the blocks are all stacked in the right place.
Scene Alignment: 4.0/5
Task Completion: 1.2/10
In this task, various small blocks and colored zones are placed on the table, meeting the task requirements. Nevertheless, only a portion of the blocks are sorted and some of them are wrongly placed in terms of color.
Scene Alignment: 2.8/5
Task Completion: 6.0/10
In this task, although we can abstract the red balls into a rope, there is also a gap between the scene and the real world. However, when the red balls are expanded into a line, our pipeline can correctly figure out the task has been solved.
Scene Alignment: 4.6/5
Task Completion: 8.0/10
In this task, a red block is placed on the table, and there exists a green block and two stickers in addition. The task is well completed as the red block ends up in a different place.
Scene Alignment: 4.0/5
Task Completion: 2.4/10
In this task, the gripper has grasped the stick and put it on the green bin rather than into the green bin, failing to complete the task. The scene contains necessary objects, as well as some relevant ones on the table, though.
Scene Alignment: 3.4/5
Task Completion: 8.0/10
In this task, the scene doesn't contain any object other than the red blocks that necessarily exist, which lowers its alignment score. The task is completed as all the red blocks are stacked together.
In this work, we are concerned with diversity from the following perspectives: (1) task diversity, a high-level diversity as identified by LLMs; and (2) trajectory diversity, a low-level diversity in terms of the dynamics of the collected data.
As for the former, we calculate the similarity between embeddings of task descriptions. For the latter, we train a latent dynamics model following Dreamerv3((Hafner et al., 2024) on trajectories collected from the generated tasks and compute the prediction errors. Higher prediction errors imply more unfamiliar dynamics being experienced, indicating the diversity of the generated trajectories.
Observed
Prediction
Difference
We define generalization as the capability to solve tasks within the same distribution, specifically whether an agent trained on the generated tasks can address similar scenarios and objectives albeit with varying initial states and minor low-level variations. To quantitatively examine this capability, we train an imitation learning policy from oracle or learned trajectories and test it on new task scenarios with varied settings. The policy uses the state-of-the-art algorithm called Diffusion Policy (Chi et al., 2023) as the backbone of our test.