xArm Handovers

Tom Wilson and Neel Sortur

Motivation

The emerging utility of grasping robots in science and industry motivates the need for effective and seamless human-to-robot interaction. Recent research in the fields of deep learning has facilitated the possibility for safe, adaptive grasping of an object handed by a human operator.

For this project, we aimed to recreate this kind of adaptive grasp with a single RGB camera mounted on the xArm's gripper tool. Using Deep Reinforcement Learning, we train the arm to identify an object placed in front of it. It repeatedly chooses a small joint movement that will result in a position closer to a grasp, eventually getting close enough to grasp the object from a user's hand.

Markov Decision Process (MDP)

State space:
- 64x64 RGB Image
Action space:
- 4 discrete axes {-1, 0, 1}, representing action deltas
  - Base Rotation (Yaw)
  - Vertical movement
  - Forward movement
  - Gripper Rotation
- So, 3^4=81 possible actions
Reward:
- Dense: negative distance to object
- Sparse: within error margin of graspable position

Methods

We trained the xArm in simulation first, significantly increasing the speed of training because we could teleport the gripper around to learn a policy.

Algorithm

The algorithm used was DQN – we learned Q values for our factored action space as shown to the right. The Q-value for an action was the sum of the max Q for each part of the action. As described in the MDP, each part of the action had {-1, 0, 1} components representing a move backwards, constant, or forwards, respectively.

Therefore, the sample action's Q-value on the right would be 0.3+0.4+0.1+0.1 = 0.9

Hyperparameters: γ (discount factor)=0.98, ε (exploration rate)=1.0 - 0.02, α (learning rate)=1e-3, batch size=64, buffer size=20000, target update frequency=1000 steps

Network

Input Layer: Batch of 64x64 RGB images
2D Convolutional layer x4, 16 channels
- ReLU Nonlinearity
- Max-Pooling
Avg-Pooling
MLP
- 3 Dense Linear Layers
- ReLU Non-Linearity
- Output Layer: 12-vector of q-values: 4 sets of 3

Q(s, a) = sum(max(q_axis) for axis in action_space)

Augmentations

We added the following training augmentations to assist in the Sim2Real transition.

Background Replace
Gripper rotation correction
Random Arm start + Object start
Random wobbling movement on Object

Q-network output

Challenges & Results

We had trouble getting the arm in simulation to calculate inverse kinematics correctly for the position our q-network was predicting. Since we were also using teleportation, we had to deal with and resolve collisions with the robot itself and the ground. It was also difficult to get the red object moving in exactly the way we wanted during training, so this required much adjustment throughout the project.

After resolving these issues and switching from dense to sparse reward though, we were able to get about 90% success for grasping in simulation. We saved the weights when the average reward was highest during training and show the learned policy below.

success.mov

Learning curves

Conclusion

This is a tricky learning problem because of the high dimensionality of the search space, and the noisy information conveyed by the observations. That being said, the problem can be learned fairly easily in simulation with the right setup, as shown in the video above. We ran into some problems doing Sim2Real, but hope that more augmentations with possibly longer training times will enable grasping in the real world.

Page updated

Report abuse