Achieving Gentle Manipulation with Deep Reinforcement Learning

Abstract

Robots must know how to be gentle when they need to interact with fragile objects, or when the robot itself is prone to wear-and-tear. We propose an approach that enables deep reinforcement learning to train policies that are gentle, both during exploration and task execution. Our approach involves augmenting the (task) reward with a penalty for non-gentleness. However, augmenting with only this penalty impairs learning: policies get stuck in a local optimum of avoiding all contact with the environment. Introducing surprise-based intrinsic rewards solves this problem, as long as the right kind of surprise is chosen---penalty-based surprise is more effective than the typical dynamics-based surprise.

Manipulation Task

In the visualized rollouts below, the task is to touch the block with greater than 5N of force. When the robot achieves this, it receives a (task) reward of +1 and the block turns green. Fingertip color for non-zero impact forces ranges between yellow and red: it is purely yellow for an impact of near-zero, and purely red for a high impact of 10N. The two rollouts shown for each condition are from policies trained on the same augmented reward, from different random initializations. Each rollout is taken after training a policy with D4PG for 500k training steps.

Task Reward Only

Agents trained on only task reward learn how to do the task, but do it a high-impact way (note the red fingertips).

Task Reward + Impact Force Penalty

When we augment the reward with an impact force penalty, agents get stuck in a local optimum of trying to avoid contact with everything in the environment---they learn to be afraid of contact, since they encounter the impact penalty before the sparse task reward, hindering exploration.

Task Reward + Impact Force Penalty + Dynamics-Based Surprise

Adding a dynamics-based surprise intrinsic reward, in addition to the task reward and impact force penalty, does not help agents learn the task. This is because agents are able to get this dynamics-based intrinsic reward by exploring interesting configurations of the hand, while avoiding contacts.

Task Reward + Impact Force Penalty + Penalty-Based Surprise

In contrast, policies trained on the combination of task reward, an impact penalty, and penalty-based surprise intrinsic reward learn to achieve the task in a gentle (i.e., low-impact) way, by gradually increasing force applied to the block.

Gentle manipulation on the real Shadow hand

Experiments on the real Shadow Dexterous hand have shown results in line with the findings obtained in simulation.

A force/torque sensor is attached to a foam block, and is used to measure the force on the block. BioTac® sensors provide a complex array of tactile signals. To compute the forces exerted by each finger, readings from the pressure channel of each tactile sensor were acquired and then normalized to match the range of the simulated tactile sensors.