Learning Reward for Robot Skills Using Large Language Models vis Self-Alignment

LinkLinkGitHub

Abstract


Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information. 

We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.

Final Policy Videos

The LLM directly generated reward function can often lack adequate numerical optimality to develop the optimal policy, where it is observed that the self-alignment update scheme effectively improves the performance reflected as:

(1) faster convergence; 

(2) higher success rate at convergence.

Oracle Reward

Fixed LLM Reward

Self-Aligned 

Success Rate

Pick Cube

Oracle Reward

Fixed LLM Reward

Self-Aligned 

Success Rate

Pick YCB Mug

Oracle Reward

Fixed LLM Reward

Self-Aligned 

Success Rate

Peg Insertion

Oracle Reward

Fixed LLM Reward

Self-Aligned

Success Rate

Open Cabinet Drawer

Oracle Reward

Fixed LLM Reward

Self-Aligned 

Success Rate

Open Cabinet Door

Oracle Reward

Fixed LLM Reward

Self-Aligned 

Success Rate

Push Chair

A Case Study into Reward Pattern Update

weight update for pick cube

policy trained with self-aligned reward scheme vs the final learnt reward: The policy and convergence speed may not be fully recovered with only the final learnt reward function.

Iter 5: The model gradually starts to reach to object where the approaching weight is observed to start to stabilize

Iter 12: The model starts to grasp the cube and reaches at random locations. Similarly, a slower increase in grasp weight is observed and goal-reaching and maintaining weight start to increase more rapidly;

Iter 23: Goal reaching and maintaining weight slows down, and grasping and approaching weight starts to increase once again. 

From the rollout policy, the policy learns to pick up the cube and tries to reach the goal fast while the cube becomes unstable and slips out of hand where re-grasping happens.

Iter 35: Goal reaching and maintaining weight keeps increasing while other parameters stay fixed. It is observed the robot can pick up the cube to go near the goal position although not exactly. The two weight terms stabilize in a few more iterations..