Since our proposed dense shaping rewards to improve online policy fine-tuning are integrated into RoboFuME [A1], we performed preliminary experiments on tasks featured in [A1] to verify its successful reproduction Demonstrating success rates on the cloth tasks comparable to the results in RoboFuME [A1] was a positive indicator for successful reproduction of the online RL pipeline. However, based on the results in Table IV, failure to achieve similar results on the Spatula Pick-Place pretrained offline RL policy on Bridge data and high-quality in-domain demonstration suggests that generalizing the online RL pipeline to new tasks is challenging.
Our ablation experiments in Table I demonstrated that without in-domain data, all policy variants struggled to get any meaningful learning signal on every task. Furthermore, a preliminary experiment fine-tuning the offline RL policy pre-trained only on Bridge data demonstrated challenges learning with sparse rewards achieves barely any increase in success rates (column 4 of Table I). Therefore, we hypothesized that with dense shaping rewards, policies pre-trained on both dense and sparse rewards would struggle to learn during online fine-tuning, and were unlikely to achieve meaningful success rates on Spatula Pick-Place without in-domain demonstrations. While RoboFuME has demonstrated interesting results in reducing human effort in reward engineering and resets, there are notable costs incurred when collecting a large quantity of in-domain demonstrations for each new task.
In-domain demonstrations are crucial for both aspects of the RoboFuME pipeline: pre-training policies with language-conditioned BC or offline RL as well as fine-tuning the task classifier (MiniGPT-4) that yields sparse rewards. We see this in the results in Table II, which shows that the sparse reward only formulation (i.e. RoboFuME) struggles to improve during online fine-tuning with 5x less demonstrations. It is not scalable to collect a large quantity of in-domain demonstrations for every new task we care about. Furthermore, the pipeline required in-domain demonstrations to meet several constraints, such as minimizing multimodality in the demonstrations, and if at evaluation or during finetuning there are any differences in the environment (e.g., lighting changes, changes in object position, changes in background), the system is very likely to fail.
To avoid confounding generalization issues and to verify successful reproduction of the pipeline, we picked two tasks in the RoboFuME task suite that performed well, Cloth Folding and Cube Covering. On both tasks, language-conditioned policies pretrained with behavior cloning (BC) and offline RL had decent success rates and comporable to the results in [A1], which improved with online finetuning. To test the necessity of in-domain demonstrations for the success of the pipeline, for each task we pretrained four policies: language-conditioned BC on Bridge data and in-domain demonstrations, offline on Bridge data and in-domain demonstrations, language-conditioned BC on Bridge data only, and offline RL on Bridge data. We used the same Bridge data subsets as RoboFuME for each task, with newly collected in-domain demonstrations using our setup.
We evaluated the four policies on the forward and backward tasks for each task category, and the results are shown in Table III. Similar to [A1], we report the success rates for the forward tasks only. We successfully reproduced the results of RoboFuME for the BC and RL policies trained on both Bridge and in-domain data (see the first two columns of Table I in [A1]). However, removing in-domain demonstration data from the pretraining dataset was catastrophic for policy learning, resulting in zero successes for both task categories (columns 1 and 2 of Table III). This confirms the heavy reliance of the RoboFuME pipeline on in-domain demonstrations.
Overall, these experiments demonstrate that high reliance on high-quality, low-multimodality in-domain demonstrations for transfer to new environments makes the system highly brittle, evidenced by the ablation experiments conducted that pretrained policies only on Bridge data without the in-domain demos resulting in zero success on all tasks. This also makes the pipeline less robust to generalizing to new tasks, objects, and environments. Therefore, while RoboFuME reduces human effort in reward specification and resets, collecting demonstrations is still a bottleneck for this method, both in human effort and in the fragility of the system. A broader goal of subsequent research in this area would be to develop online RL methods that are much less reliant on or completely eliminate the need for in-domain demonstrations, as well as pre-training methods that are able to more effectively extract useful priors from offline datasets like Bridge datasets.
To summarize, we perform experiments to verify the successful reproduction of RoboFuMe [A1] and additional experiments assessing the extent of RoboFuME's reliance on high-quality in-domain demonstrations with low multimodality for both pre-training RL policies and fine-tuning the sparse reward classifier. Finally, we explore RoboFuME's selection of Bridge data subsets for pre-training their offline RL policies for tabletop manipulation, which we use in our experiments. We find there are robustness challenges for pipelines relying on a large number of high-quality, in-domain demonstrations, and incorporating task-relevant Bridge data facilitates transfer to the desired task.
Pretraining on the entire Bridge dataset would be computationally and practically infeasible. As such, choosing subsets of prior datasets to train on is crucial for downstream performance. We test this with a novel Spatula Pick-Place task that is not part of the list of RoboFuME tasks. We define the forward task to be put spatula on plate and the backward task to be move spatula to the left of the plate, formatting language descriptions similar to those in the Bridge dataset. We consider three different subsets of Bridge data. tabletop granular comprises 796 trajectories performing tasks on a tabletop similar to the one used in our experiments. tabletop granular + toy kitchen comprises 823 trajectories, with toy kitchen trajectories specifically focusing on pick and place tasks with a variety of objects. tabletop granular + toy kitchen + dark wood comprises 1764 trajectories, with dark wood trajectories including some spatula pick and place tasks similar to ours, among many other tasks. We pre-train three separate policies using offine RL on each of the Bridge data combinations, as well as 120 in-domain demonstrations (50 forward task rollouts, 50 backward task rollouts, and 20 failures), as done in RoboFuME [A1]. The results of evaluating the three policies on the forward and backward tasks are shown in Table IV.
From these results, we observe that increasing the size and diversity of the pretraining dataset does not necessarily lead to better results on the task, in fact it is the opposite. Qualitative analysis of the pick and place behavior shows a similar declining trend with more Bridge data. Even when adding the dark wood dataset which has the same task in the dataset, the performance is the worst among the three subset combinations. This demonstrates the importance of pre-training data selection for the performance of the eventual policy.
We noted that the success rates for the novel Spatula Pick-Place task was much lower than the reported success rates in [A1], suggesting issues with generalization capabilities of the pipeline. We also observed that RoboFuME codebase not only pre-trains on Bridge data, but also in-domain demonstration data that is upsampled by 8x to be proportional to the size of Bridge data. This could explain the observed results that increasing the size of pre-training Bridge datasets causes worse performance, because it dilutes the upsampled in-domain data. This demonstrates that the RoboFuME pipeline is heavily dependent on in-domain demonstrations to succeed, as verified by the tests in the previous section.
From our experiments selecting different pretraining datasets for the novel Spatula Pick-Place, the selection of prior offline datasets is incredibly important for downstream success of the pretrained policy. It might be unique to the RoboFuME pipeline that smaller prior datasets are better to avoid diluting the in-domain demonstration data, and the optimal selection of prior data could have greatly improved both the pre-trained policy and fine-tuned policy's performance. However, it is still surprising that including demonstrations of similar tasks being evaluated on (though in a different environment) does not help with task completion. More research into extracting relevant data from prior datasets can alleviate this issue.
We conduct preliminary experiments in simulation to investigate the effects of fine-tuning with a dense reward. In simulation, the dense reward is naively the negative L2 distance between the robot and the target location. We further investigate whether using a dense reward formulation can reduce the reliance of the policy on in-domain demonstrations during policy pretraining. The results of the simulation experiments for policies using the standard number of in-domain demonstrations, including the reproduction of the RoboFuME pipeline, can be seen in Figure 8. We see in Figure 8 that adding the dense reward generally performs comparably, verifying our dense reward formulation is sensible.
To determine whether one of these reward formulations, sparse only (as done in [A1]) vs. dense and sparse, is more adversely affected by reducing the number of in-domain demonstrations during pre-training, we conduct two simulation experiments for policies using fewer in-domain demonstrations, specifically half the pre-training data compared to the standard quantity, for which the results are shown in Figure 9. We see that these also perform comparably with each other and with the previous experiment using the standard number of in-domain demonstrations. The dense and sparse reward formulation reaches higher success rates slightly faster despite fewer in-domain demonstrations. Due to the dense rewards in simulation only being approximated with a single waypoint at the target location, we believe a denser waypoint trajectory should help with learning on the real robot, with the simulation experiments verifying that adding dense rewards at least does not hurt or inhibit learning, and potential for learning on the real robot setup.
The performance of the online RL fine-tuning system is influenced by the accuracy of VLM keypoint and waypoint predictions for task completion. While [A2] verifies that keypoint-based reasoning of VLMs are reliable and generalizable, we make some additional observations about modifying these prompts to accurately generate waypoint trajectories in 3D space for dense rewards used during online RL fine-tuning.
In addition to GPT-4o, we test the robustness of our prompts as input to GPT-4V (used by [A2], larger but slower than GPT-4o) and Gemini Pro to emphasize the robustness of our approach. We tested the outputs of these models on the three tasks described above: Cloth Folding, Cube Covering, Spatula Pick-Place, as well as some additional tasks from RoboFuME to verify the robustness of the system such as Candy Sweeping, Drawer Opening, and pick and place tasks using other objects in a toy kitchen setup.
First, we observe that the natural language description of the task is important for facilitating the correct 3D waypoint generation, and this is the case for all the VLMs tested. For example, for Spatula Pick-Place, the backward task phrased as "Place the spatula on the left of the plate" sometimes led to waypoints generated in a way that moves the spatula to the left side of the yellow plate (but still on the plate), rather than on the table on the left side of the plate. We had to tune the natual language descriptions as such. Tuning typically occurred on the level of the specific task description, and once tuned remained consistent for all experiments. Once tuned, for the tasks reported above, manual inspection of the VLM outputs indicate that the outputs are generally reliable and consistent across trials.
Another observation is the importance of providing dual-angle inputs to VLMs for 3D spatial reasoning, where the side-view of the environment was also provided to the system to provide depth information. We modified the meta-prompt from [A2] as such to accommodate this additional visual input to generate the desired 3D waypoint outputs. Here, we see some qualitative differences in VLM output, showing the strengths and weaknesses of different models in 3D spatial reasoning. Gemini tends to perform comparably to GPT-4o and slightly better than the GPT-4V in 3D spatial reasoning. GPT-4V in particular sometimes struggles to generate sensible depth movements using the side-angle view that facilitates successful completion of the task. The most common failure mode is occasionally generating grid points that are not immediately reachable from the previous grid block in the sequence, and this is more common for models with shorter context windows which forget the metaprompt's instructions on grid sequence generation. This observation would be important for extensions of this work investigating 3D waypoint generation, as depth information is critical for pick and place, tasks performed on uneven surfaces, as well as tasks with very specific grasp points like drawer opening.
Our experiments and empirical observations verify the hypotheses in [A2] that state-of-the-art VLMs are capable of spatial reasoning at a sufficient accuracy to facilitate tabletop manipulation tasks of various complexities, where a top-down and side view of the environment gives sufficient information to perform the tasks. We anticipate that the spatial reasoning of VLMs will only continue to improve, and more studies can be conducted in mobile or open-world manipulation, to see whether the spatial reasoning capabilities of VLMs extend to even more complex and dynamic environments beyond tabletop manipulation.
Example of annotated inputs to VLM: top-down view annotated with keypoints and grid tiles, side view has line labels for depth information.
*Annotations bolded for illustration purposes only.
Example of a VLM-generated trajectory: tiles marked by green borders and red arrows indicate direction of motion. The red point indicates current robot position from RANSAC prediction. Our dense reward formulation encourages the robot to move to the next block, following the green arrow.
*Annotations bolded for illustration purposes only.
[A1] Jingyun Yang, Max Sobol Mark, Brandon Vu, Archit Sharma, Jeannette Bohg, and Chelsea Finn. Robot fine-tuning made easy: Pre-training rewards and policies for autonomous real-world reinforcement learning. 2024 IEEE International Conference on Robotics and Automation.
[A2] Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. Robotics: Science and Systems 2024.