Composing Diverse Policies for Temporally Extended Tasks

Abstract

Robot control policies for temporally extended and sequenced tasks are often characterized by discontinuous switches between different local dynamics. These change-points are often exploited in hierarchical motion planning to build approximate models and to facilitate the design of local, region-specific controllers. However, it becomes combinatorially challenging to implement such a pipeline for complex temporally extended tasks, especially when the sub-controllers work on different information streams, time scales and action spaces. In this paper, we introduce a method that can compose diverse policies comprising motion planning trajectories, dynamic motion primitives and neural network controllers. We introduce a global goal scoring estimator that uses local, per-motion primitive dynamics models and corresponding activation state-space sets to sequence diverse policies in a locally optimal fashion. We use expert demonstrations to convert what is typically viewed as a gradient-based learning process into a planning process without explicitly specifying pre- and post-conditions. We first illustrate the proposed framework using an MDP benchmark to showcase robustness to action and model dynamics mismatch, and then with a particularly complex physical gear assembly task, solved on a PR2 robot. We show that the proposed approach successfully discovers the optimal sequence of controllers and solves both tasks efficiently.

Supplementary video

video_composing_diverse_policies.mp4
gear_assembly_fast.avi

Neural Policy Supplementary Material

This material includes additional detail around the controller components used for experimentation. For obtaining the neural policies (e.g. Figure.10) we used the following

rule of thumb:

  • We use a Behaviour Cloning loss with the VAE loss to train our policy models with Adam with α = 0.001 with weight decay of 1e−6. We obtained 50 demonstrations of each subtask.
  • Our input image has a size of 128x128 pixels. We observe that the Neural Network Policy does not re- quire a sophisticated feature extractor like ResNet50, ResNet101 to create necessary features for the task. Using those extractors leads to the same final perfor- mance, but increases the training time significantly. We use 5 convolutional layers of 4x4 filters with batch normalization and leaky ReLUs.
    • ResNet50

10/10 (Sub Task Performance)

10/10 (Full Task Performance)

    • ResNet101

10/10

10/10

    • Small Conv

10/10

10/10

  • Additional “what-if” (it’s tilted, flipped, not visible, etc.) training examples were detrimental to the performance of the model. In order to incorporate that part of the state space, a full set of overlapping and interpolating examples need to be provided.