Zero-Shot Transfer of a Tactile-based Continuous Force Control Policy from Simulation to Robot

Luca Lach¹², Robert Haschke², Davide Tateo³, Jan Peters³, Helge Ritter², Júlia Borràs¹, Carme Torras¹

¹Institut de Robòtica i Informàtica Industrial, CSIC-UPC

²Neuroinformatics Group, Technical Faculty, Bielefeld University

³Computer Science Department, Technical University Darmstadt

Paper: IROS2024 | NeurIPS Workshop

Abstract

The advent of tactile sensors in robotics has sparked many ideas on how robots can leverage direct contact measurements of their environment interactions to improve manipulation tasks. An important line of research in this regard is grasp force control, which aims to manipulate objects safely by limiting the amount of force exerted on the object. While prior works have either hand-modeled their force controllers, employed model-based approaches, or not shown sim-to-real transfer, we propose a model-free deep reinforcement learning approach trained in simulation and then transferred to the robot without further fine-tuning.

We, therefore, present a simulation environment that produces realistic normal forces, which we use to train continuous force control policies. A detailed evaluation shows that the learned policy performs similarly or better than a hand-crafted baseline. Ablation studies prove that the proposed inductive bias and domain randomization facilitate sim-to-real transfer.

Code | Models | CAD

Simulation Environment

A schematic overview of the grasping scenario we consider is shown on the left (upper), which details all parameters needed to define it. The gripper is depicted in its fully open state, with an object located between the fingers somewhere on the grasping axis.

W and O refer to the world and object frames, where W is considered fixed w.r.t. the gripper base and centered between the fingertips. If the offset oy is non-zero, the symmetric object is displaced w.r.t. to this center, causing one fingertip to touch it earlier than the other. The object width is defined by wo and the maximum penetration depth (or object deformation) is given by dp. Softer objects can be deformed more heavily and thus have larger values for dp.

As the controller should learn to maintain the object position during grasping, we sample oy during episode initialization, thus exposing the learner to different object-gripper alignments. wo is also varied, such that the policy does not implicitly assume all objects to be of equal width.

We then performed a series of robot experiments in order to have realistic joint and sensor behavior in the MuJoCo simulation depicted on the left (lower). The results can be seen in the figures below.

Experimental Evaluation

We use Proximal Policy Optimization (PPO) to train the grasping policies. We first evaluate our proposed method in simulation, and then apply the policies to the real robot and compare them in terms of force reward and object movements.

We run training for a total of 4M steps with an episode length of 150 steps. In the beginning of each episode, all randomization parameters are sampled anew. Our network consists of two fully connected layers with 50 neurons each and ReLu activations. The output layer has two neurons, one per each finger’s desired position delta.

In a first evaluation, we compare our policy to a hand-crafted baseline model and perform an ablation study over two components of our method. Results can be seen on the left (upper), where we evaluated all models on different object softnesses kappa. All models perform well except for the one trained without domain randomization.

To assess whether our method is transferable to the real world without fine-tuning, we evaluate several policies on TIAGo using ten test objects of varying stiffness. We chose not to include πNO-CURR as it did not learn any rewarding behavior and neither πNO-PEN since it was clearly unable to minimize object movements. On the other hand, πNO-RAND is included in the real-world evaluation to test whether it performs well on a specific object type that is similar to the environment configuration it was trained on. We perform 20 grasping trials per object and method, yielding 20×10×4 = 800 real-world trials in total. f goal is sampled randomly, where the real-world experiment results from Sec. III-C were used to determine the upper and lower bounds of the sampling interval. In each trial, the object is offset to one finger; in half of the trials, it is placed closer to the left finger, and in the other half, closer to the right. Then, the policy is commanded to perform a grasp, and after 6 seconds (150 steps at 25 Hz), the gripper is automatically opened again. The process repeats after the reward is computed and the traveled distance is measured. Note that the force reward is not comparable between objects since it depends on the object’s softness and width because they determine the amount of time a force reward can be achieved. Wider objects come in contact with the fingers earlier, and force rewards are generated in more time steps than for narrower objects. The softer an object is, the slower the force builds up, leading to a reduced reward. Instead of measuring and integrating the object velocity to calculate the total object displacement for each trial, we measured it by placing the objects on millimeter paper, annotating the start and end positions, and calculating the difference. No displacements for the Mug are reported as it is almost as wide as the gripper opening and would, therefore, not move during a grasp regardless of the policy. As a consequence of using millimeter paper, the reported real-world object displacement measurements are less precise than the ones in the simulation, meaning they can not be compared directly.

For the complete discussion of the evaluation, please refer to the paper (link at the top of the page).

Video Demonstration

iros2024_frc_ctrl_comp.mp4

IROS24 Presentation

IROS24_0751_small.mp4

If you have any questions, feel free to contact us!

Page updated

Google Sites

Report abuse