Learning and Deploying Robust Locomotion Policies with Minimal Dynamics Randomization
Luigi Campanaro, Siddhant Gangapurwala, Wolfgang Merkt, and Ioannis Havoutis
Dynamic Robot Systems Group (DRS), University of Oxford
Abstract
Training of deep reinforcement learning (DRL) locomotion policies often requires massive amounts of data to converge to a desired behavior. In this regard, simulators provide a cheap and abundant source. For successful sim-to-real transfer, exhaustively engineered approaches such as system identification, dynamics randomization, and domain adaptation are generally employed. As an alternative, we investigate a simple strategy of random force injection (RFI) to perturb system dynamics during training. We show that application of random forces enables us to emulate dynamics randomization. This allows us to obtain locomotion policies that are robust to variations in system dynamics. We further extend RFI, referred to as extended random force injection (ERFI), by introducing an episodic actuation offset. We demonstrate that ERFI provides additional robustness to inertial shifts offering on average a 62.5% improved performance over RFI for variations in system mass. We also show that ERFI is sufficient to perform a successful sim-to-real transfer on two different quadrupedal platforms, ANYmal C and Unitree A1, even for perceptive locomotion over uneven terrain in outdoor environments.
In the following sections we provide videos of hardware experiments with ANYmal C and Unitree A1, the policies were trained in simulation with ERFI-50.
Why does ERFI work?
When transferring controllers from simulation to real systems we need to account for several variations in their dynamics. In the following we explain how ERFI --which is composed of RFI and RAO-- takes care of some of the discrepancies that affect the performance of controllers the most in dynamic environments: delays, kinematics and mass variations.
How does RFI model delays?
In the figure on the left we show the effects of adding RFI as a feed-forward term of the PD controller (Kp=15, Kd=1) when commanding a position of +0.17 [rad] (~10 [deg]) to the hind right knee.
As can bee seen from the plot, the yellow line reaches the desired position faster than the green line, although the green line settles earlier.
This implies that RFI adds stochasticity to the rise and settling times, i.e. it either increases or reduces the rise and settling times. The increase or decrease depends on the direction of the perturbation. This allows us to implicitly randomise actuation dynamics, especially parameters that relate to delays, friction and inertia.
How does RAO model mass and kinematics variations?
The figure on the left demonstrates the effects of adding RAO as a feed-forward term of the PD controller (Kp=15, Kd=1) when commanding a position of +0.17 [rad] (~10 [deg]) to the hind right knee.
In this case, the additional torque shifts the desired position of the joint and implicitly models offsets in the joint position (kinematics variations) or in the payload supported by the robot.
Evidences of these effects can be found in Fig. 5 and 6 (paper), where changing the position of the knee joint effects the success rate of the controller.
With regards to the mass, Fig. 3 (paper) and video 3, 4, and 10, demonstrate the robustness of the controllers even when the unmodelled payload reaches 42% of the total weight of the robot.
ERFI-based Policy on ANYmal C
1) ERFI-50 Flat Ground
2) ERFI-50 Uneven Ground Indoor
3) ERFI-50 with Kinova Manipulator (A)
4) ERFI-50 with Kinova Manipulator (B)
5) ERFI-50 Outdoor
ERFI-based Policy on Unitree A1 Exhibiting Dynamic Locomotion
6) Reactivity on Slippery Surface
7) Transitioning from Slippery Surface to Foam
8) Impulsive Forces with Payload ~3.5 Kg (A)
9) External Forces with Payload ~3.5 Kg
10) Payload ~5. Kg
11) Payload ~3.5 Kg
Wooden Block: 3.598 Kg
Wooden Block: 1.481 Kg
12) Blind Locomotion over Ramp with Payload ~3.5 Kg (A)
13) A1 Walking on Wooden Cylinders
Wooden Cylinders
14) Weak Actuation test demonstrating Adaptive Behaviour
Adaptive Behaviour
To test the controller's ability to adapt to variations in system dynamics not explicitly observed during training, we reduced the position tracking gain (Kp) of the right-hind knee to 33% of its original value.
We observed that the policy was still able to track the desired velocity commands.
Comparing Policies trained with ERFI-50, Domain Randomization, and No-Randomization
15) ERFI-50: A1 Dynamic Gait
ERFI-50
The policy was trained using ERFI-50. We demonstrate that ERFI-50 strategy is able to exhibit dynamic and robust locomotion behaviour.
16) Domain Randomization trained
with RMA Randomization Settings
Domain Randomization - Aggressive
The controller on the left was trained using Domain Randomization.
The parameters and intervals of the Domain Randomization are taken from: "RMA: Rapid Motor Adaptation for Legged Robots", Kumar et al.
We observed that the resulting policy converged to a conservative behaviour in the absence of the Adaptation Module proposed in the RMA paper.
This is in consistency with the observations of Xie et al. in "Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion".
17) Domain Randomization trained
with Smaller Distributions wrt RMA
Domain Randomization - Soft
The controller on the left was trained using Domain Randomization.
A smaller randomization range was used compared to the distributions adopted in the RMA paper.
Nonetheless, the policy was able to track higher velocity commands and exhibit more dynamic behaviours.
18) No Randomization of any kind (A)
No-Randomization
The controller was trained without utilising any Domain Randomization strategy.
We were not able to achieve a successful sim-to-real transfer.
ERFI robustness to delays
Injecting delays
In the figure on the left we show the effects of delays on the PD controller tracking (Kp=15, Kd=1), when commanding a step of +0.17 [rad] (~10 [deg]) to the hind right knee.
A1 policy robustness to delay
During forward locomotion at 0.5 m/s we injected delays (as above) in the actuation dynamics of each motor and the policy demonstrated a 100% success rate.
We didn't consider injecting more than 10 steps of delay because the delay measured on the robot is ~10 [ms].
In simulation the PD control is executed at 500 [Hz]: 10 [steps] * 0.002 [s/step] = 0.02 [s], and this is already double the delay measured on the robot.
Additional Experiments
Additional runs of some of the experiments above are attached in the following section.
19) External Forces
20) Impulsive Forces with Payload ~3.5 Kg (B)
21) Blind Locomotion over Ramp with Payload ~3.5 Kg (B)
22) ERFI-50 Flat Ground Outdoor (A)
23) ERFI-50 Flat Ground Outdoor (B)
24) ERFI-50 Flat Ground Outdoor (C)
25) No Randomization of any kind (B)