Yuwei Zeng1, Yao Mu2 1, Lin Shao1
1 National University of Singapore,2 The University of Hong Kong
Abstract
Learning reward functions remains the bottleneck to equip a robot with a broad repertoire of skills. Large Language Models (LLM) contain valuable task-related knowledge that can potentially aid in the learning of reward functions. However, the proposed reward function can be imprecise, thus ineffective which requires to be further grounded with environment information.
We proposed a method to learn rewards more efficiently in the absence of humans. Our approach consists of two components: We first use the LLM to propose features and parameterization of the reward, then update the parameters through an iterative self-alignment process. In particular, the process minimizes the ranking inconsistency between the LLM and the learnt reward functions based on the execution feedback. The method was validated on 9 tasks across 2 simulation environments. It demonstrates a consistent improvement in training efficacy and efficiency, meanwhile consuming significantly fewer GPT tokens compared to the alternative mutation-based method.
Final Policy Videos
The LLM directly generated reward function can often lack adequate numerical optimality to develop the optimal policy, where it is observed that the self-alignment update scheme effectively improves the performance reflected as:
(1) faster convergence;
(2) higher success rate at convergence.
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Pick Cube
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Pick YCB Mug
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Peg Insertion
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Open Cabinet Drawer
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Open Cabinet Door
Oracle Reward
Fixed LLM Reward
Self-Aligned
Success Rate
Push Chair
A Case Study into Reward Pattern Update
Iter 5: The model gradually starts to reach to object where the approaching weight is observed to start to stabilize
Iter 12: The model starts to grasp the cube and reaches at random locations. Similarly, a slower increase in grasp weight is observed and goal-reaching and maintaining weight start to increase more rapidly;
Iter 23: Goal reaching and maintaining weight slows down, and grasping and approaching weight starts to increase once again.
From the rollout policy, the policy learns to pick up the cube and tries to reach the goal fast while the cube becomes unstable and slips out of hand where re-grasping happens.
Iter 35: Goal reaching and maintaining weight keeps increasing while other parameters stay fixed. It is observed the robot can pick up the cube to go near the goal position although not exactly. The two weight terms stabilize in a few more iterations..
BibTex
@article{zeng2024learning,
title={Learning Reward for Robot Skills Using Large Language Models via Self-Alignment},
author={Zeng, Yuwei and Mu, Yao and Shao, Lin},
journal={arXiv preprint arXiv:2405.07162},
year={2024}
}