Learning a High-quality Robotic Wiping Policy Using Systematic
Reward Analysis and Visual-Language Model Based Curriculum
https://arxiv.org/abs/2502.12599
Yihong Liu, Dongyeop Kang, Sehoon Ha
Learning a High-quality Robotic Wiping Policy Using Systematic
Reward Analysis and Visual-Language Model Based Curriculum
https://arxiv.org/abs/2502.12599
Yihong Liu, Dongyeop Kang, Sehoon Ha
Robotic surface wiping is an important manipulation task with wide domains. We utilize deep reinforcement learning (deep RL) to generate high-level policies through simulation without prior demonstrations, for dynamic adaptation to complex environmental variables.
Baseline: Naive RL Formulation
To train a high-quality navigational wiping policy, we have a quality reward W_q per step, and a task completion reward W_T per episode . It became challenging to balance W_T and W_q. The RL agent could easily learn to finish episodes as early as possible without maintaining wiping qualities (left); or keeps getting quality rewards without task completion (right).
To address this parameter-sensitive multi-task RL training, we first demonstrated the infeasibility of the naive formulation, and developed 2 techniques that we believe are generalizable to tasks facing similar challenges of balancing procedure qualities and rapid task completion.
Method1: Bounded Reward Design
We introduced a bounded reward design with concentric circular checkpoints, which redefines the formulation. It is theoretically grounded. We proved the desired behaviors inherently lead to maximal rewards.
Animation of the bounded rewards. Quality rewards could only be gained once per checkpoint region.
Navigation completion success rate lift from 58% to 92% as no convergence to forever wiping from 5 seeds.
Method2: VLM-based Curriculum
While the new formulation makes the quality-critical problem feasible, RL learning is still hyperparameter sensitive. because it depends on sampling. To ensure successful trajectories exist and hence can be learned subsequently, we propose a novel Vision-Language Models (VLM) based curriculum learning system, which automatically monitors training metrics and adjusts relative weights of reward terms during the learning process. This simulates the parameter tuning process of human experts.
The VLM-based Curriculum system has a LLM agent and a VLM agent.
1) The LLM agent takes in the metrics and identify if extra information is needed.
2) The VLM agent to summarize open-ended failure reason (optional).
3) The LLM agent takes in performance metrics and extra information, and propose new reward scales for the next training phases.
Below are the prompts (green texts on the right) for each step respectively.
2.1 LLM Agent takes in evaluation metrics and determine if extra information is needed.
Depending on the training progress, the LLM could request for different extra information before updating. If the completion rate is low, vision feedback of ending scene summarized by a separate VLM will be provided to describe failure reasons (e.g., no contact, or close to endpoint without finish wiping). If the force metrics require improvements, detailed force percentiles will be provided.
This step is designed to only provide necessary details into prompts to avoid LLM's catastrophic forgetting on important information.
You are a reinforcement learning researcher trying to teach a robot wiping table with multiple subtasks:
1) complete navigation between waypoints.
2) the wiping and landing should learn to gradually exert a target force of 60N.
Based on the metrics provided, please choose the tools for further understanding the performance.
`receive_observation`: Obtain visual observation to understand the causes of navigation failure.
`describe_statistics`: View pressure distribution percentiles, including minimum, 25%, mean, 75%, and maximum.
### Example Response1, e.g., When the navigation completion rate is low, and you want to understand the behavior.:
['receive_observation']
### Example Response2, e.g., When there's an enhancement in the completion rate of navigation, and you aim to decrease pressure differences:
['describe_pressure']
### Example Response3, e.g., When you have sufficient information already, then returns empty:
[]
Feel free to choose combination of tools too. Now let's begin!
{metrics}
### Response:
2.2 The VLM agent to summarize open-ended failure reason.
Navigation failures can arise from various scenarios. Leveraging VLM’s semantic capabilities allows us to understand the causes of failures, reducing the need for labor-intensive monitoring and iterative metric development.
This hierarchical approach enhances system's extensibility. Separating LLM and VLM optimizes reasoning and visual data interpretation respectively.
You are a reinforcement learning expert training an agent to wipe table
surface. With a well trained policy, the end effector (EE, a black wiper) should do the following within 100 steps:
(1) EE touches the table top.
(2) EE navigates towards red waypoint (EE touches the table and red waypoint visible).
(3) EE wipes the red waypoint (red waypoint invisible).
(4) EE nevigates towards green waypoint (red waypoint invisible and green waypoint visible).
(5) EE wipes the green waypoint (green waypoint invisible).
(6) Done.
You will be provided with episode length and the last snapshot of a policy replay, and report the observations in one sentence.
Here are a few examples.
## EXAMPLE RESPONSE1:
EE did not make contact with the tabletop and instead ended up facing upwards in the air.
## EXAMPLE RESPONSE2:
EE is positioned close to the green waypoint, reaching the maximum allowed horizon of 500 steps, but it didn't learn to complete wiping it.
## EXAMPLE RESPONSE3:
EE completes navigation as no waypoint is visible.
Okay, now let's begin!
Episode length: {episode_length} (maximum allowed: {horizon})
###RESPONSE:
2.3 The LLM agent takes in performance metrics and extra information, and propose new reward scales for the next training phases.
When the navigation completion rate is low, the LLM agent increase navigation completion rewards, enhancing the gradient signals for this metric at the expense of increased landing forces—potentially encouraging successful landings regardless of costs. In later training stages, the LLM agent can adjust the penalty multiplier for landing forces, significantly reducing them without adversely affecting other metrics. Combined adjustments lead to better results.
1. You are a reinforcement learning researcher trying to teach a robot wiping table with multiple subtasks:
1) complete navigation between waypoints.
2) the wiping and landing should learn to gradually exert a stable target force of 60N, with little variance.
2. Below is the Trainable object that defines the scales of each reward terms.
{training_class}
While learning all subtasks together is ideal, use curriculum learning if needed, starting with subtask (1) navigation, then (2) force control. Focus should shift to (2) once (1) is mastered.
Please include a one-sentence observation and the Python code for implementing the Trainable.
### EXAMPLE RESPONSE1:
The navigation subtask is learned, but the pressure variance is too large. Next, focus on the pressure learning task, increasing the pressure variance penalties (force_lambda) and decreasing the pressure reward range (fr_sigma).
```python
Trainable(
unit_wiped_reward = 30,
arm_limit_collision_penalty = -10,
pressure_threshold_max = 110,
pressure_threshold_min = 10,
contact_reward = 0.8,
landing_pressure_penalty_multiplier = 0.5,
navigation_pressure_penalty_multiplier = 0.5,
navigation_complete_reward_clean = 1000,
ee_accel_penalty = 1,
fr_sigma = 18, ## decrease by 2
force_lambda = 2e-3 ## increase by 1e-3
)
```
### EXAMPLE RESPONSE2:
The pressure shows a long-tailed distribution. Let's try to reduce variance by narrowing down the pressure thresholds (pressure_threshold_max, pressure_threshold_min).
```python
Trainable(
unit_wiped_reward = 30,
arm_limit_collision_penalty = -10,
pressure_threshold_max = 100, ## increased -10
pressure_threshold_min = 20, ## increased 10
contact_reward = 0.8,
landing_pressure_penalty_multiplier = 0.5,
navigation_pressure_penalty_multiplier = 0.5,
navigation_complete_reward_clean = 1000,
ee_accel_penalty = 1,
fr_sigma = 20,
force_lambda = 1e-3
)
```
### EXAMPLE RESPONSE3:
The completion rate is not high enough (over 90%), and the landing force is on average too high. Next stage, focus on learning navigation completion (navigation_complete_reward_clean) and reducing landing forces (pressure_threshold_max, landing_pressure_penalty_multiplier).
```python
Trainable(
unit_wiped_reward = 30,
arm_limit_collision_penalty = -10,
pressure_threshold_max = 110,
pressure_threshold_min = 10,
contact_reward = 0.8,
landing_pressure_penalty_multiplier = 1, ## increase by 0.5
navigation_pressure_penalty_multiplier = 0.5,
navigation_complete_reward_clean = 1100, ## increase by 100
ee_accel_penalty = 1,
fr_sigma = 15,
force_lambda = 1e-3
)
```
Okay, now let's begin!
### latest metrics:
{metrics}
### Weights used for training from the latest fine-tuning cycle:
{initial_scales}
### RESPONSE: