PDDM: Planning with Deep Dynamics Models

Deep Dynamics Models for Learning Dexterous Manipulation

Anusha Nagabandi, Kurt Konolige, Sergey Levine, Vikash Kumar

Dexterous multi-fingered hands can provide robots with the ability to flexibly perform a wide range of manipulation skills. However, many of the more complex behaviors are also notoriously difficult to control: Performing in-hand object manipulation, executing finger gaits to move objects, and exhibiting precise fine motor skills such as writing, all require finely balancing contact forces, breaking and reestablishing contacts repeatedly, and maintaining control of unactuated objects. Learning-based techniques provide the appealing possibility of acquiring these skills directly from data. However, current learning approaches either require large amounts of data and produce task-specific policies, or they have not yet been shown to scale up to more complex and realistic tasks requiring fine motor skills. In this work, we demonstrate that our method of online planning with deep dynamics models (PDDM) addresses both of these limitations; we show that improvements in learned dynamics models, together with improvements in online model-predictive control, can indeed enable efficient and effective learning of flexible contact-rich dexterous manipulation skills -- and that too, on a 24-DoF anthropomorphic hand in the real world, using just 2-4 hours of purely real-world data to learn to simultaneously coordinate multiple free-floating objects.

Paper (arxiv)

Blog (BAIR)

Code

METHOD OVERVIEW:

At a high level, this method of online planning with deep dynamics models involves an iterative procedure of (a) running a controller to perform action selection using predictions from a trained predictive dynamics model, and (b) training a dynamics model to fit that collected data. With recent improvements in both modeling procedures as well as control schemes using these high-capacity learned models, we are able to demonstrate efficient and autonomous learning of complex dexterous manipulation tasks.

BAODING BALLS:

Baoding balls - also referred to as Chinese relaxation balls - refer to the task of simultaneously manipulating two spheres around each other in the hand. This task requires both dexterity and coordination, which is why it is commonly used for improving finger coordination, relaxing muscular tensions, and recovering muscle strength and motor skills after surgery. In this work, we put our PDDM algorithm to the test by learning this task of Baoding balls with 0 simulation, using ~2 hours worth of real-world data.

Autonomously-learned task of Baoding Balls.

TRAINING SETUP:

In our experiments, we use the ShadowHand: A 24-DoF 5-fingered anthropomorphic hand. In addition to its inbuilt proprioceptive sensing at each joint, we separately trained and integrated a dilated CNN-based RGB tracker to produce 3D position estimates for the external objects (Baoding balls) in this task, using a 280x180 RGB stereo image pair from a calibrated camera rig.

To enable continuous experimentation in the real world, we developed an automated reset mechanism that consists of a ramp and an additional robotic arm: The ramp funnels the dropped Baoding balls to a specific position and then triggers the 7-DoF Franka-Emika arm to use its parallel jaw gripper to pick them up and return them to the ShadowHand's palm to resume training. The episode terminates if the task horizon of 10 seconds has elapsed or if the hand drops either ball, which then triggers the automatic reset procedure again.

TRAINING PROGRESS:

0-0.25 hours

0.25-0.5 hours

0.5-1.5 hours

~2 hours

PDDM's sample efficiency facilitates training complex behaviors directly with real-world experience on physical hardware, precluding the need for sim-to-real transfer or prior system/environment-specific information in general. On this task of Baoding balls, learning takes approximately 2 hours worth of real-world data to build a rich dynamics model and plan through it to achieve complex, dynamic, and contact-rich behaviors. Additional challenges present in such a real-world scenario include sensor noise, communication delays, unknown object properties, deformable materials, and further computationally expensive details such as the friction properties.

The system can very reliably perform 90-degree turns, and somewhat reliably perform 180-degree turns.

*We note that training has not plateaued, and will hopefully get even better. ShadowHand is a complex system that incurs significant wear and tear and often requires maintenance. System properties sometimes even change after each repair. We are continuously running the system (in between repairs) and will be updating the results here as we go.

SIMULATED TASKS:

In order to develop our PDDM algorithm itself (as used above), we first designed a suite of simulated tasks on which we aimed to study the general challenges presented by contact-rich dexterous manipulation tasks. Some of the main challenges in dexterous manipulation involve the high dimensionality of the hand, the prevalence of complex contact forces that must be utilized and balanced to manipulate free floating objects, and the potential for failure from dropping objects in the hand. We identify a set of experimental tasks that specifically highlight these challenges, requiring delicate, precise, and coordinated movement.

The results of online planning through our learned models for these designed tasks are shown below, followed by some of the benefits of this approach, comparisons to other approaches, and the effect of various design decisions.

9 DOF D'Claw turning valve to random (green) targets~20 min worth of data

16 dof D'Hand pulling a weight via the manipulation of a flexible rope~1 hourworth of data

24 DOF ShadowHand performing in-hand reorientation of a free-floating cube to random (shown) targets~1 hour worth of data

24 DOF ShadowHand following desired trajectories with tip of a free-floating pencil~1-2 hours worth of data

24 DOF ShadowHand rotating two free-floating Baoding balls~30-60 min worth of data

Model-Reuse:

We find that models learned via PDDM can be re-purposed, sometimes even without additional training, to perform related tasks. For example, the model trained for the Baoding task of performing counterclockwise rotations (left) can be re-purposed to move a single ball to a goal location in the hand (middle) or to perform clockwise rotations (right) instead of the learned counterclockwise ones.

Trained on: CCW Baoding

Model reuse test: go-to single location

Model reuse test: CW Baoding

Task-Flexibility:

We study the flexibility of PDDM by experimenting with handwriting, where the base of the hand is fixed and arbitrary characters need to be written through the coordinated movement of the fingers and wrist. Although even writing a fixed trajectory is challenging, we see that writing arbitrary trajectories requires a degree of flexibility and coordination that is exceptionally challenging for prior methods. PDDM's separation of modeling and task-specific control allows for generalization across behaviors, as opposed to discovering and memorizing the answer to a specific task/movement. Below, we render PDDM's handwriting results that were trained on random paths for the green dot but then tested in a zero-shot fashion on numerical digits.

Comparisons:

We compare our method to the following state-of-the-art model-based and model-free RL algorithms:

Nagabandi et. al learns a deterministic neural network model, combined with a random shooting MPC controller
PETS is a state-of-the-art model-based approach that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation
NPG is a model-free natural policy gradient method, and has been used in prior work on learning manipulation skills
SAC is a state-of-the-art off-policy model-free RL algorithm
MBPO is a recent hybrid approach that uses data from its learned model to accelerate policy learning

On our simulated suite of dexterous manipulation tasks, PDDM consistently outperforms these prior methods both in terms of learning speed and final performance, often solving flexible tasks that prior methods cannot.

Most algorithms succeed on the value turning task, although ours slightly outperforms other methods.

Ours solves Baoding using about 2.7 hours worth of data (10000 samples), while other methods fail.

SAC(3e6) & NPG(6e6 steps) behave comparatively to ours for the cube reorientation task with two goals.

Ours significantly outperforms other methods when having to achieve various different goals.

Ours, SAC, and NPG all learn to successfully write a fixed trajectory, although ours does so using considerably less data (as expected).

Ours significantly outperforms other methods on the task of writing arbitrary trajectories.

Analysis of Design Decisions:

Here we present the impact of various design decisions on our model and our online planning method. We use the Baoding balls task for these experiments, though we observed similar trends on other tasks.

A sufficiently large architecture is crucial, indicating that the model must have enough capacity to represent the complex dynamical system

PDDM (ours), with action smoothing and soft updates, greatly outperforms other sampling-based planning methods

Warm-starting model weights is sometimes harmful earlier in training, when the model likely overfits. It makes less of a difference in later stages when data is plentiful

Too soft of a weighting (for the reward-weighted updates to the sampling distribution) leads to minimal movement of the hand, and too hard of a weighting leads to aggressive behaviors that frequently drop the objects

Ensembles are helpful, especially earlier in training when non-ensembled models can overfit badly and thus exhibit overconfident and harmful behavior

Short horizons lead to greedy planning, while long horizons suffer from compounding errors in the model predictions

Comprehensive Videos

Hardware Results Overview

Method & Results Overview

Citation

@INPROCEEDINGS{PDDM,

     AUTHOR = {Anusha Nagabandi AND Kurt Konoglie AND Sergey Levine AND Vikash Kumar},

     TITLE = "{Deep Dynamics Models for Learning  Dexterous Manipulation}",

     BOOKTITLE = {Conference on Robot Learning (CoRL)},

     YEAR = {2019}, }