DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems
50 muscles
120 muscles
18 muscles
Reaching
Uniform random noise, pure DEP exploration and DEP-RL applied to the arm26 task.
uniform noise
pure DEP
DEP-RL
Uniform random noise, pure DEP exploration and DEP-RL applied to the arm750 task.
uniform noise
pure DEP
DEP-RL
Uniform random noise, pure DEP exploration and DEP-RL applied to the ostrich-foraging task.
uniform noise
pure DEP
DEP-RL
Running
Uniform random noise, pure DEP exploration and DEP-RL applied to the ostrich-run task.
uniform noise
pure DEP
DEP-RL
Running gaits for the different considered algorithms.
DEP-RL
MPO
TD4
Obstacles
Robustness for DEP-RL
ostrich-slopetrotter
ostrich-stepdown
A policy at the end of training (1.5e8 iterations) was used for the videos above, as we observed its robustness to increase with training time. The resulting running gait is not as natural as faster gaits encountered earlier in training (see DEP-RL running video).
2D Humanoid
DEP-MPO in human-run
Transfer of DEP-MPO to human-stepdown
DEP-MPO is applied to a dense reward 2D running task. It learns a fast, symmetric gait that uses the full capabilities of the model. It is also robust against OOD perturbations.
MPO in human-run
Transfer of MPO to human-stepdown
MPO is applied to the dense reward 2D running task. We show the most commonly obtained gait and its transfer to an unseen environment.
DEP-MPO in human-hop
Transfer of DEP-MPO to human-hopstacle
MPO in human-hop
In the sparse reward hopping task, DEP-MPO solves the task and learns robust gaits. MPO never achieves a single non-zero reward.
DEP-RL Alternation
arm26
humanreacher
DEP-MPO interaction is shown during different training stages. The DEP duration was slightly increased when compared to the reported values for visualization. Muscle activity is shown in color (orange for large activity, blue for small activity)
Inverse Model Influence
Identity matrix as inverse model
Randomly shuffled identity matrix
We assume a known 1:1 correspondence of sensors to actuators (left). This means that we know which muscle length is connected to which muscle. This allows DEP to drive exploration by inducing correlated motions. If we randomly shuffle the connections while still keeping 1:1 pairs, exploration deteriorates (right).