Learning and Deploying Robust Locomotion Policies with Minimal Dynamics Randomization


Luigi Campanaro, Siddhant Gangapurwala, Wolfgang Merkt, and Ioannis Havoutis


Dynamic Robot Systems Group (DRS), University of Oxford


Abstract

Training of deep reinforcement learning (DRL) locomotion policies often requires massive amounts of data to converge to a desired behavior. In this regard, simulators provide a cheap and abundant source. For successful sim-to-real transfer, exhaustively engineered approaches such as system identification, dynamics randomization, and domain adaptation are generally employed. As an alternative, we investigate a simple strategy of random force injection (RFI) to perturb system dynamics during training. We show that application of random forces enables us to emulate dynamics randomization. This allows us to obtain locomotion policies that are robust to variations in system dynamics. We further extend RFI, referred to as extended random force injection (ERFI), by introducing an episodic actuation offset. We demonstrate that ERFI provides additional robustness to inertial shifts offering on average a 62.5% improved performance over RFI for variations in system mass. We also show that ERFI is sufficient to perform a successful sim-to-real transfer on two different quadrupedal platforms, ANYmal C and Unitree A1, even for perceptive locomotion over uneven terrain in outdoor environments.

In the following sections we provide videos of hardware experiments with ANYmal C and Unitree A1, the policies were trained in simulation with ERFI-50.

Why does ERFI work?

When transferring controllers from simulation to real systems we need to account for several variations in their dynamics. In the following we explain how ERFI --which is composed of RFI and RAO-- takes care of some of the discrepancies that affect the performance of controllers the most in dynamic environments: delays, kinematics and mass variations.

How does RFI model delays?

In the figure on the left we show the effects of adding RFI as a feed-forward term of the PD controller (Kp=15, Kd=1) when commanding a position of +0.17 [rad] (~10 [deg]) to the hind right knee.

As can bee seen from the plot, the yellow line reaches the desired position faster than the green line, although the green line settles earlier.

This implies that RFI adds stochasticity to the rise and settling times, i.e. it either increases or reduces the rise and settling times. The increase or decrease depends on the direction of the perturbation. This allows us to implicitly randomise actuation dynamics, especially parameters that relate to delays, friction and inertia.

How does RAO model mass and kinematics variations?

The figure on the left demonstrates the effects of adding RAO as a feed-forward term of the PD controller (Kp=15, Kd=1) when commanding a position of +0.17 [rad] (~10 [deg]) to the hind right knee.

In this case, the additional torque shifts the desired position of the joint and implicitly models offsets in the joint position (kinematics variations) or in the payload supported by the robot.

Evidences of these effects can be found in Fig. 5 and 6 (paper), where changing the position of the knee joint effects the success rate of the controller.

With regards to the mass, Fig. 3 (paper) and video 3, 4, and 10, demonstrate the robustness of the controllers even when the unmodelled payload reaches 42% of the total weight of the robot.


ERFI-based Policy on ANYmal C

ERFI_flat_grounANYmalC.mp4

1) ERFI-50 Flat Ground

20220608_180531.mp4

2) ERFI-50 Uneven Ground Indoor

ERFI_arm_1.mp4

3) ERFI-50 with Kinova Manipulator (A)

ERFI_arm_2.mp4

4) ERFI-50 with Kinova Manipulator (B)

Ascent - Math Institute - Edited.mov

5) ERFI-50 Outdoor

ERFI-based Policy on Unitree A1 Exhibiting Dynamic Locomotion

slippery_surface.mp4

6) Reactivity on Slippery Surface

slippery_soft_surface.mp4

7) Transitioning from Slippery Surface to Foam

impulsive_forces_1.mp4

8) Impulsive Forces with Payload ~3.5 Kg (A)

external_pushes_1.mp4

9) External Forces with Payload ~3.5 Kg

walking_5_kg.mp4

10) Payload ~5. Kg

walking_3.5_kg.mp4

11) Payload ~3.5 Kg

Wooden Block: 3.598 Kg

Wooden Block: 1.481 Kg

ramp_2.mp4

12) Blind Locomotion over Ramp with Payload ~3.5 Kg (A)

walking_on_cylinders.mp4

13) A1 Walking on Wooden Cylinders

Wooden Cylinders

limping_a1.mp4

14) Weak Actuation test demonstrating Adaptive Behaviour

Adaptive Behaviour

To test the controller's ability to adapt to variations in system dynamics not explicitly observed during training, we reduced the position tracking gain (Kp) of the right-hind knee to 33% of its original value.

We observed that the policy was still able to track the desired velocity commands.

Comparing Policies trained with ERFI-50, Domain Randomization, and No-Randomization

faster_gait.mp4

15) ERFI-50: A1 Dynamic Gait

ERFI-50

The policy was trained using ERFI-50. We demonstrate that ERFI-50 strategy is able to exhibit dynamic and robust locomotion behaviour.

RMA_dom_rand.mp4

16) Domain Randomization trained

with RMA Randomization Settings

Domain Randomization - Aggressive

The controller on the left was trained using Domain Randomization.

The parameters and intervals of the Domain Randomization are taken from: "RMA: Rapid Motor Adaptation for Legged Robots", Kumar et al.

We observed that the resulting policy converged to a conservative behaviour in the absence of the Adaptation Module proposed in the RMA paper.

This is in consistency with the observations of Xie et al. in "Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion".


soft_domain_randomisation.mp4

17) Domain Randomization trained

with Smaller Distributions wrt RMA

Domain Randomization - Soft

The controller on the left was trained using Domain Randomization.

A smaller randomization range was used compared to the distributions adopted in the RMA paper.

Nonetheless, the policy was able to track higher velocity commands and exhibit more dynamic behaviours.

A1fail2.mp4

18) No Randomization of any kind (A)

No-Randomization

The controller was trained without utilising any Domain Randomization strategy.

We were not able to achieve a successful sim-to-real transfer.

ERFI robustness to delays

Injecting delays

In the figure on the left we show the effects of delays on the PD controller tracking (Kp=15, Kd=1), when commanding a step of +0.17 [rad] (~10 [deg]) to the hind right knee.

A1 policy robustness to delay

During forward locomotion at 0.5 m/s we injected delays (as above) in the actuation dynamics of each motor and the policy demonstrated a 100% success rate.

We didn't consider injecting more than 10 steps of delay because the delay measured on the robot is ~10 [ms].

In simulation the PD control is executed at 500 [Hz]: 10 [steps] * 0.002 [s/step] = 0.02 [s], and this is already double the delay measured on the robot.

Additional Experiments

Additional runs of some of the experiments above are attached in the following section.

external_pushes_2.mp4

19) External Forces

impulsive_forces_2.mp4

20) Impulsive Forces with Payload ~3.5 Kg (B)

ramp_1.mp4

21) Blind Locomotion over Ramp with Payload ~3.5 Kg (B)

A1_walking_1.mp4

22) ERFI-50 Flat Ground Outdoor (A)

A1_walking_2.mp4

23) ERFI-50 Flat Ground Outdoor (B)

A1_walking_3.mp4

24) ERFI-50 Flat Ground Outdoor (C)

A1fail1.mp4

25) No Randomization of any kind (B)