Policy Decorator (ours) vs. RL

Our refined policy, learned through Policy Decorator, achieves remarkably high success rates while retaining the favorable attributes of the original base policy. On this page, we compare our refined policy against a policy directly learned by RL to demonstrate how our refined policy exhibits significantly smoother and more natural behavior. Given that SAC with sparse rewards cannot solve most of the tasks we tested, we instead employ RLPD (a state-of-the-art RL method that uses demonstrations to boost learning) to train policies and produce visualizations.

Key Observations:

As shown in the videos below, RL policies exhibit noisy and jerky motions. This is because they are trained solely to achieve the goal (sparse reward) without explicit constraints on the motions. As studied in [2], the jerky actions produced by an RL policy often fail to transfer to the real world.

In contrast, our refined policies exhibit smooth and natural behavior by staying close to the base policy. This is achieved through the bounded residual action strategy. Since the base policies are trained on demonstrations with smooth and natural behaviors (usually from human teleoperation or motion planning), our refined policies inherit the favorable attributes of the original base policy.

[2] Yuzhe Qin, Hao Su, and Xiaolong Wang. From one hand to multiple hands: Imitation learning for dexterous manipulation from single-camera teleoperation. IEEE Robotics and Automation Letters, 7(4):10873–10881, 2022.

rl_output_new_peg.mp4

The Peg Insertion task requires highly precise manipulation, with the hole having only 3mm of clearance. It requires at least half of the peg to be pushed sideways into the hole, making it more challenging than similar tasks [1]. The RL policy exhibits jerky motions when attempting to insert the peg into the hole, posing a high risk of causing damage to objects when transferred from simulation to the real world. In contrast, our refined policy from Policy Decorator demonstrates smooth motions throughout the entire trajectory, eliminating the risk of damaging the peg or the box.

[1] Jing Xu, Zhimin Hou, Zhi Liu, and Hong Qiao. Compare contact model-based control and contact model-free learning: A survey of robotic peg-in-hole assembly strategies. arXiv preprint arXiv:1904.05240, 2019.

rl_output_slow_new_faucet.mp4

In the Turn Faucet task, the RL policy displays erratic behavior when the gripper approaches the faucet, resulting in collisions with both the ground and the faucet itself. This behavior poses significant risks and challenges for sim-to-real transfer due to the collisions. In contrast, our refined policy from Policy Decorator demonstrates remarkably smooth motions, avoiding unnecessary collisions with both the ground and the faucet.

Google Sites

Report abuse