H2O+: An Improved Framework for Hybrid Offline-and-Online RL with Dynamics Gaps

Haoyi Niu*, Tianying Ji*, Bingqi Liu, Haocheng Zhao, Xiangyu Zhu, Jianying Zheng, Pengfei Huang,Guyue Zhou, Jianming Hu, Xianyuan ZhanTsinghua University, Beihang University

Real-world experiment on wheel-legged robot

output.mp4

Standing still

h2o+_standing_12.1_out.mp4

sac_standing_12.1_out.mp4

iql_standing_12.1_out.mp4

darc_standing_12.1_out.mp4

h2o_standing_12.1_out.mp4

Moving forward

h2o+_moving_12.1_out.mp4

sac_moving_12.1_out.mp4

iql_moving_12.1_out.mp4

darc_moving_12.1_output.mp4

h2o_moving_12.1_out.mp4

More Complex Tasks with Larger Dynamics Gaps

Standing Still with a larger dynamics gap

(modifies the weight of the robot in simulation from 12.1kg to 4.6kg while the mass of the real robot is 12.0kg)

sac_standing_4.6_out.mp4

darc_standing_4.6_out.mp4

h2o_standing_4.6_out.mp4

h2o+_standing_4.6_out.mp4

Moving Forward with a larger dynamics gap

sac_moving_4.6_out.mp4

darc_moving_4.6_out.mp4

h2o_moving_4.6_out.mp4

h2o+_moving_4.6_out.mp4

Quantitative Investigations on Real-World Experiments

Control Performance Comparisons

Standing Still

Moving Forward

Only H2O+ and IQL policies successfully maintain the balance of the robot for over 30 seconds (s).

H2O+ regulated the displacement of the robot within 0.2m, whereas IQL only barely maintains balance, yet with a large range swinging even reaches 1.6m.

H2O+ achieves the desired forward movement with precise velocity control, maintains smooth speed changes and a steady pitch angle.

In contrast, the IQL policy manages to maintain balance but causes backward movement and considerable effort in doing so, leading to a shaky period lasting over 7 seconds.

More Complex Tasks with Larger Dynamics Gaps

Standing Still with a Larger Dynamics Gap

H2O+ outperforms IQL in terms of the robot’s capacity to maintain a stationary stance. H2O+ effectively conﬁnes the robot’s displacement within approximately -1 meter, whereas IQL leads the robot to oscillate between its original position and a broader range of approximately 1.5 meters.

Moving Forward with a Larger Dynamics Gap

None of the methods except H2O+ are able to control the robot to move forward, and only IQL maintains equilibrium for a long period of time. H2O+, despite its moving backward at ﬁrst, moves forward at a steady speed close to 0.2 m/s for a long period of time.

Online Simulation Data Quality

Standing Still

Comparison of H2O+ and H2O simulated data quality on the real-world robot “standing still” task. We visualize the coverage and the normalized value of reward, displacement, velocity, angle, angular velocity, and action. In the ''standing still'' task, we observe that H2O explores a more focused high-value area, whereas H2O+ spans a broader high-value area, thus demonstrating superior diversity characteristics in simulated data, which would benefit the overall performance.

Reward

Displacement

Velocity

Angle

Angular Velocity

Action

Moving Forward

Comparison of H2O+ and H2O simulated data quality on the real-world robot “moving forward” task. We visualize the coverage and the normalized value of reward, displacement, velocity, angle, angular velocity, and action. In the “moving forward” task, H2O+ provides wider coverage across the state-action space and displays better diversity in its data, leading to a more robust and thorough exploration of the state-action space.

Reward

Displacement

Velocity

Angle

Angular Velocity

Action

Distribution Analysis of Offline Dataset

We visualize the state, action and reward distribution of the two real-world robot tasks. For the standing still task, we collect 16588 transitions of data based on the real robot. And for the ofﬂine dataset of moving forward, we collect 16588 transitions of data from the moving process of the real robot.

State, action and reward distribution of the standing still dataset

State, action and reward distribution of the moving forward dataset

Page updated

Google Sites

Report abuse