Robust Reinforcement Learning for Continuous Control with Model Misspecification

Site overview

The videos in this site show the performance of the Entropy-regularized MPO (E-MPO) agent versus the Robust Entropy-regularized MPO (RE-MPO) agent for three tasks:

1. Cartpole Balance

2. Walker Walk

3. Cheetah

4. Shadowhand: Dexterous robotic hand

* It should be noted that similar performance can be seen for the non entropy regularized versions (i.e, MPO and R-MPO respectively) and therefore these videos were omitted.

** In addition the Soft-Robust versions had similar performance to the robust versions and therefore were omitted as well.

policy_7913660_cam_0_video_53000_evaluator_2.mp4

Entropy-regularized MPO (E-MPO)

Cartpole Balance Task Evaluation Environment Env2

The non-robust agent is unable to balance the pole.

policy_3164934_cam_0_video_25000_evaluator_2.mp4

Robust Entropy-Regularized MPO (RE-MPO)

Cartpole Balance Task Evaluation Environment Env2

The robust agent is able to successfully balance a pole length it has never seen before

policy_2114570_cam_0_video_15000_evaluator_1.mp4

Entropy-regularized MPO (E-MPO)

Walker Walk Task Evaluation Environment Env2

This agent is unstable and although it succeeds in achieving a gate movement, it very quickly falls to the ground.

policy_1607518_cam_0_video_14200_evaluator_1.mp4

Robust Entropy-Regularized MPO (RE-MPO)

Walker Walk Task Evaluation Environment Env2

Note that the agent learns to drag its leg due to the change in quadracep length. This prevents the agent from falling, is very stable and results in the improved performance compared to the non-robust agent.

policy_3535368_cam_0_video_19800_evaluator_2.mp4

Domain Randomization E-MPO

Here, the agent struggles to stand up and displays a similar behaviour, albeit slightly more robust, to that of E-MPO. However, it learns a significantly different policy to that of RE-MPO



policy_5283800_cam_0_video_41000_evaluator_2.mp4

Entropy-regularized MPO (E-MPO)

Cheetah Task Evaluation Environment Env2

The Cheetah learns an aggressive and unstable running policy which causes it to fall to the ground

policy_4266942_cam_0_video_39800_evaluator_2.mp4

Robust Entropy-Regularized MPO (RE-MPO)

Cheetah Task Evaluation Environment Env2

The Cheetah learns a running policy that prevents it from falling over

policy_3554182_cam_basket_front_right_video_14200_evaluator_2.mp4

Entropy-Regularized MPO (E-MPO)

Shadowhand Orientation Task Evaluation Environment Env2

The Shadowhand attempts to orient the cube, but does not know how to manipulate a cube that is smaller than the one on which it was trained

policy_2755577_cam_basket_front_right_video_15000_evaluator_2.mp4

Robust Entropy-regularized MPO (RE-MPO)

Shadowhand Orientation Task Evaluation Environment Env2

The Shadowhand manages to orient the cube into the correct position