Jump-Start RL (JSRL) [3] is the best-performing baseline among those we tested. It achieves similar performance to our Policy Decorator in a few scenarios (5 out of 18) but performs significantly worse in the rest.
Additionally, JSRL does not actually "improve" the base policy but instead learns an entirely new policy. This means that even if it achieves a high success rate, it does not preserve the desired properties of the original base policy, such as smooth and natural motion. On this page, we compare our refined policy against the JSRL policy to demonstrate how our refined policy exhibits significantly smoother and more natural behavior.
[3] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. Jump-start reinforcement learning. In International Conference on Machine Learning, pages 34556–34583. PMLR, 2023.
Key observations:
As shown in the videos below, JSRL policies exhibit noisy and jerky motions. This occurs because JSRL only utilizes the base policy to build the learning curriculum without imitating the motion of the base policy. Consequently, even if it achieves a high success rate, it does not retain the desired properties of the original base policy, such as smooth and natural motion. As studied in [2], such jerky actions often fail to transfer to the real world.
In contrast, our refined policies exhibit smooth and natural behavior by staying close to the base policy. This is achieved through the bounded residual action strategy. Since the base policies are trained on demonstrations with smooth and natural behaviors (usually from human teleoperation or motion planning), our refined policies inherit the favorable attributes of the original base policy.
[2] Yuzhe Qin, Hao Su, and Xiaolong Wang. From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation. IEEE Robotics and Automation Letters, 7(4):10873–10881, 2022.
The Peg Insertion task requires highly precise manipulation, with the hole having only 3mm of clearance. It requires at least half of the peg to be pushed sideways into the hole, making it more challenging than similar tasks [1]. The JSRL policy exhibits notably shaky motions, particularly when the robot arm attempts to pick up the peg and insert it into the hole. Such excessive shaking presents significant risks of damaging the robot arm and colliding with other objects when transferred from simulation to the real world. In contrast, our refined policy from Policy Decorator demonstrates smooth motions throughout the entire trajectory, eliminating the risk of damaging the peg or the box.
[1] Jing Xu, Zhimin Hou, Zhi Liu, and Hong Qiao. Compare contact model-based control and contact model-free learning: A survey of robotic peg-in-hole assembly strategies. arXiv preprint arXiv:1904.05240, 2019.
In the Turn Faucet task, the JSRL policy displays erratic behavior when the gripper approaches the faucet, resulting in collisions with the faucet itself. This behavior poses significant risks and challenges for sim-to-real transfer due to the collisions. In contrast, our refined policy from Policy Decorator demonstrates remarkably smooth motions, avoiding unnecessary collisions and potential damage.