Offline Meta-Reinforcement Learning for Industrial Insertion


Tony Z. Zhao*, Jianlan Luo*, Oleg Sushkov, Rugile Pevceviciute, Nicolas Heess, Jon Scholz, Stefan Schaal, Sergey Levine

X, The Moonshot Factory, Intrinsic, Deepmind, Google Brain, UC Berkeley

In this paper, we introduce Offline meta-RL with Demonstration Adaptation (ODA) and apply it to industrial insertion. We address two specific challenges:

First, conventional meta-RL algorithms require lengthy online meta-training. We show that this can be replaced with appropriately chosen offline data, resulting in an offline meta-RL method that only requires demonstrations and trials from each of the prior tasks.

Second, meta-RL methods can fail to generalize to new tasks that are too different from those seen at meta-training time, which poses a particular challenge in industrial applications, where high success rates are critical. We address this by combining contextual meta-learning with direct online finetuning: if the new task is similar to those seen in the prior data, then the contextual meta-learner adapts immediately, and if it is too different, it gradually adapts through finetuning.

We show that ODA is able to quickly adapt to a variety of different insertion tasks with a success rate of 100% using only a fraction of the samples needed for learning from scratch.

Experiments

Robot Setup

We focus our experiments on industrial insertion. We run experiments with a KUKA iiwa7 robot. The agent controls the TCP twist of the robot at 10Hz, which is tracked and interpolated by a downstream impedance controller at 1000 Hz. The observation to our agent consists of the robot's TCP pose, velocity, and the wrench measured at the tool tip.

For both training and test tasks, we add perturbation noise from a uniform distribution Uniform[-1mm, 1mm] at the beginning of each episode. The policy does not have access to this noise, and therefore must be robust to it when inserting into the socket.

Task Setup

We perform offline meta-training on 11 insertion tasks, as shown below. For each of them, we run DDPG till convergence and save the replay buffer to be used as offline dataset. We also rollout the converged policy with noise injected to increase the data size. On average for each task, we have 500 trials from DDPG replay buffer, 200 trials from noisy policy, and 20 trials of demonstrations from humans.

Meta-Adaptation

We meta-adapt our pretrained policy using demos on 12 test tasks. Our policy reaches 100% success rate for 6/9 tasks, outperforming standard offline RL policy by large margins.

Online Finetuning

For tasks that meta-adaptation fails to reach 100%, we perform online finetuning initialized with the adapted policy. As shown in the training curves below, our method reaches 100% success rate with less than 7 minutes of training on average. While performing behavior cloning with demonstrations failed to reach 100%. This is because simply behavior cloning the demonstrations will not result in a policy that is robust to perturbation noise in the starting pose.

Challenge Tasks

We additionally challenge our method with three out-of-distribtion tasks : (1) RAM insertion, (2) network card insertion, and (3) complex shape insertion.

(1) and (2) are require handling delicate electronics and significant application of force. Training from scratch would be impractical, since an untrained policy would easily damage the circuit board. (3) is challenging because of the small clearance and the very different contact shape compared to training tasks. For these three tasks, we do not inject uniform noise to the starting pose to make it more tractable.

Our method is able to solve all three tasks 100/100 within 30 minutes. For network card insertion, we are able to succeed without any finetuning. We posit that inserting the RAM is significantly harder than the network card because it is much wider (288 vs. 18 pins) and requires high precision along the yaw axis.

Scaling with More Data



We plot the adaptation performance of three tasks as the number of training tasks increases. The success rate increases steadily as the algorithm has access to more data.

Hyperparameters and Reproducibility


Network Architectures:

Actor network

  • fully connected with hidden dimensions [600,400,32].

  • outputs action directly.

Critic network

  • fully connected with hidden dimensions [800,600,32].

  • follows Bellemare et al. to output a distribution, with 100 atoms.

Encoder network

  • fully connected with hidden dimensions [600,400,32].

  • outputs the mean and standard deviation of the latent variable.


We use the same architecture when running baselines, e.g. BC baseline uses the same actor network as described above.


Hyperparameters - ODA Pretraining

actor lr: 3e-4

critic lr: 3e-4

encoder lr: 3e-4

adam beta1: 0.88

adam beta2: 0.92

sample offline data batch size: 1024

sample demo data batch size: 64

latent dimension: 12

beta: 0.01

action noise: 0.05

discount factor: 0.99

Action space and observation space are normalized to [-1, 1]

We perform data collection and training asynchronously, and cap learn/act ratio at 3.


Hyperparameters - ODA Finetuning

actor lr: 3e-5

critic lr: 3e-5

encoder lr: 3e-5

We implemented lr warm up for the first 10k gradient steps.

The rest of parameters are the same as pretraining.


Hyperparameters - AWAC Pretraining/Finetuning

Same as ODA Pretraining/Finetuning, without the need for encoder related parameters.


Hyperparameters - BC

Same as ODA Pretraining, without the need for encoder and critic related parameters.


Stopping Criteria

We use one training task "E-model-10p" to select the training length for each method, and use it for all evaluations of test tasks.

For finetuning, we stop training when 50 consecutive successes are reached. Then evaluate the last checkpoint with 100 trials.