DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems

50 muscles

120 muscles

18 muscles

Reaching

Uniform random noise,  pure DEP exploration and DEP-RL applied to the arm26 task.

arm26_noise_cut-15851.mp4

uniform noise

arm26_dep_cut-15852.mp4

pure DEP

arm26_trained_cut-16042.mp4

DEP-RL

Uniform random noise,  pure DEP exploration and DEP-RL applied to the arm750 task.

arm750_noise_cut-15899.mp4

uniform noise

arm750_dep_cut-15900.mp4

pure DEP

arm750_trained_cut-16038.mp4

DEP-RL

Uniform random noise,  pure DEP exploration and DEP-RL applied to the ostrich-foraging task.

ostrich_foraging_uniform-2022-04-12_15.25.28-16134.mp4

uniform noise

ostrich_foraging_dep-2022-04-12_15.38.11-16133.mp4

pure DEP

ostrich_foraging-2022-04-12_11.27.32-16161.mp4

DEP-RL

Running

Uniform random noise,  pure DEP exploration and DEP-RL applied to the ostrich-run task.

ostrich_noise.mp4

uniform noise

ostrich_dep_expl.mp4

pure DEP

derl_running_new.mp4

DEP-RL

Running gaits for the different considered algorithms.

best_deprl_running_sideview.mp4

DEP-RL

best_mpo_running_sideview.mp4

MPO

best_td4_running_sideview-2022-05-25_15.00.58.mp4

TD4

Obstacles

Robustness for DEP-RL

deprl_slopetrotter-cut.mp4

ostrich-slopetrotter

derl_stepdown_new_cut.mp4

ostrich-stepdown

A policy at the end of training (1.5e8 iterations) was used for the videos above, as we observed its robustness to increase with training time. The resulting running gait is not as natural as faster gaits encountered earlier in training (see DEP-RL running video).

2D Humanoid

running_2d.mp4

DEP-MPO in human-run

running_2d_robust.mp4

Transfer of DEP-MPO to human-stepdown

DEP-MPO is applied to a dense reward 2D running task. It learns a fast, symmetric gait that uses the full capabilities of the model.  It is also robust against OOD perturbations.

suboptimal_mpo.mp4

MPO in human-run

suboptimal_mpo_transfer.mp4

Transfer of MPO to human-stepdown

MPO is applied to the dense reward 2D running task. We show the most commonly obtained gait and its transfer to an unseen environment. 

depmpo_jump.mp4

DEP-MPO in human-hop

jumping_perturbations.mp4

Transfer of DEP-MPO to human-hopstacle

mpo_jump.mp4

MPO in human-hop

In the sparse reward hopping task, DEP-MPO solves the task and learns robust gaits. MPO never achieves a single non-zero reward. 

DEP-RL Alternation

iclr_arm26_deprl.mp4

arm26

iclr_humanreacher_deprl.mp4

humanreacher

DEP-MPO interaction is shown during different training stages. The DEP duration was slightly increased when compared to the reported values for visualization.  Muscle activity is shown in color (orange for large activity, blue for small activity)

Inverse Model Influence

ostrich_not_shuffled.mp4

Identity matrix as inverse model

ostrich_shuffled.mp4

Randomly shuffled identity matrix

We assume a known 1:1 correspondence of sensors to actuators (left). This means that we know which muscle length is connected to which muscle. This allows DEP to drive exploration by inducing correlated motions. If we randomly shuffle the connections while still keeping 1:1 pairs, exploration deteriorates (right).