LLM Trainer: Automated Robotic Data Generating via Demonstration Augmentation using LLMs

Abraham George¹ Amir Barati Farimani¹

¹Carnegie Mellon University Mechanical Engineering

[Paper] [Code]

Overview

LLM Trainer is a fully automated pipeline that leverages the world knowledge of Large Language Models (LLMs) to transform a small number of human demonstrations (as few as one) into a large robot dataset for imitation learning. Our approach decomposes demonstration generation into two steps: (1) offline demonstration annotation that extracts keyframes, salient objects, and pose–object relations; and (2) online keypose retargeting that adapts those keyframes to a new scene, given an initial observation. Using these modified keypoints, our system warps the original demonstration to generate a new trajectory, which is then executed, and the resulting demo, if successful, is saved. Because the annotation is reusable across scenes, we use Thompson sampling to optimize the annotation, significantly improving generation success rate. We evaluate our method on a range of tasks, and find that our data annotation method consistently outperforms expert-engineered baselines. We further show an ensemble policy that combines the optimized LLM feed-forward plan with a learned feedback imitation learning controller. Finally, we demonstrate hardware feasibility on a Franka Emika Panda robot.

LLM-Based Data Generation

Our method for LLM-based data generation has two main steps: First, the LLM annotates the human demonstration, identifying keyframes (timesteps that are important inflection points for the task), listing relevant objects at each keyframe, and explaining the relationship between the robot and these objects. To provide the LLM with visual observations of the demonstration without overloading the LLM, we first query the LLM for "visually important frames". These visually important frames, along with a short description (one sentence) of the task and the recorded robot and object poses, are then used by the LLM to annotate the human demonstration. Second, the LLM uses this annotation, along with an initial observation of a newly initialized scene, to determine how the robot's pose should be adjusted at each keypoint. These adjusted keypoints are then used to warp the original demonstration trajectory, and the warped trajectory is rolled out in the new scene. This rollout is recorded, and if successful, saved as a novel demonstration. Because the first step of this process does not require information from the new scene, we can reuse these annotations, saving compute cost and opening the door for optimization. By employing a multi-armed bandit-based method, we are able to optimize the demo annotation step, improving data generation success rate by 2-3 times.

Ensembling Method

Due to the multi-armed bandit optimization method, our LLM feedforward control scheme serves as a capable agent after data collection. Despite their strong performance, these LLM based policies have a fatal flaw - as a pure feed-forward method, they cannot respond to disturbances in the environment. The trained IL agent solves this problem, but struggles with longer horizon tasks and generalizing to out-of-distribution observations - two tasks that the LLM-based feed-forward policy excels at. We take advantage of the complementary strengths of these two polices by ensembling them together.. Since the feedforward policy excels at high-level planning, the ensembled agent begins by executing the LLM feedforward trajectory. This continues until the ensembled agent detects that the feedforward policy has made an error, at which point the agent switches to the IL policy to correct the error. Once the error is fixed, the ensemble policy can "reattach" to the feedforward trajectory at an appropriate point and continue executing the pre-planned trajectory until another error is detected.

Simulation Evaluations

To evaluate our proposed method, we use three of the RoboMimic simulation environments from MimicGen, and the block pick-and-place and block stacking Panda-Gym tasks from OneACTPlay. Additionally, we included two variations of the block stacking task: block stack flipped and block stack walking. In the ``flipped" version of the stack task, the order in which the blocks must be stacked is flipped for 50% of runs. The only way this change is conveyed to the agent is the switching of the colors of the transparent goal location blocks in the agent's visual observation - there is no explicit text or flag of any kind informing the agent of the change. This task was included to illustrate the reasoning capabilities of the LLM. In the walking version of the stack task, the blocks undergo a random walk, moving 0.4mm in a random direction each timestep while on the ground. Once they are picked up by the robot, they stop moving. This task was included to see how well the trained agents handled time-varying tasks, and was used to evaluate agents trained on the regular stack task. For these tasks, we used the same human demonstrations used in the baselines, so, for the RobotMimic tasks, we used 10 human demonstrations, and for the OneACTPLay block manipulation tasks, we used a single human demonstration.

Simulation Results

We evaluated the generation success rate of our method and compared it to the baselines, reporting the success rate of the best annotation found during the multi-armed bandit exploration, the average generation success rate when using a new annotation (this would be the success rate of our method if we did not employ RL optimization), and the overall success rate for the full data generation process.

The results show that our LLM-based data generation method significantly outperformed the MimicGen and OneACTPlay baselines, with the optimized LLM achieving a higher success rate on all tasks. Additionally, the total success rate, which includes the exploration phase of the optimization method, also beat the baseline in all tasks except for the Stack task. Importantly, our method outperformed the benchmarks, which used a human expert to manually annotate the trajectories, while requiring no human input other than a short (one-sentence) description of the task.

We used the generated demonstrations to train imitation learning agents and reported the performance of the trained agents on each task for various numbers of demonstrations. Examining the restuls, we see that our method performs roughly the same as MimicGen, and moderately outperforms OneACTPLay. These results show that our method generates high-quality data useful for training imitation learning policies.

Hardware Validation

Finally, to validate that our data generation method works on hardware, we evaluated it on a mug-cleanup task, in which a robot must open a drawer, pick up the mug from the scene, place it in the drawer, and then close the drawer.

We used our method to generate 100 success demonstrations, generating 32 failures along the way, resulting in a total success rate of 75.8%. During this process, the best annotation (the annotation with the highest success rate) was used 95 times, generating 78 successful demonstrations - an average success rate of 82%. We also did a separate experiment to evaluate the success rate of our method without optimization, and found a success rate of 45\%. We then trained an IL agent using the 100 successful demonstrations. We evaluated the IL agent, the feedward LLM agent (using the best annotation), and the ensembled agent, running 20 trials for each. We found the IL agent had a success rate of 60%, the LLM feedforward agent had a success rate of 80%, and the ensembled agent had a success rate of 85%. These results align with our findings from simulation, with the optimization method significantly improving the data generation success rate, and the ensemble agent performing slightly better than the LLM feedforward agent.