Composing Complex Skills by Learning Transition Policies

with Proximity Reward Induction


Abstract

Intelligent creatures acquire complex skills by exploiting previously learned skills as well as learning to transition between them. To empower machines with this ability, we propose a modular framework with transition policies which effectively connect primitive skills to perform hierarchical tasks. We introduce reward predictors that are specifically designed for training transition policies without handcrafted rewards. The proposed method is evaluated on a diverse set of experiments for continuous control in both bi-pedal locomotion and robotic arm manipulation tasks. The results demonstrate the importance of employing transition policies as well as verifying the effectiveness of reusing existing skills.

Highlights

Obstacle Course

This tough environment requires the agent to walk, jump and crawl its way to success.

Serve

Inspired by tennis, this task is composed of tossing and hitting a ball to a target.

Patrol

Similar to a guard patrol, the agent must walk forwards and backwards repeatedly.

The highlights show our model's performance on a few complex tasks. Supplemental videos on the other tasks including the performance of baselines are available below.

Training Curves

Quantitative Results

Manipulation

Locomotion


Ablation study on Proximity functions

Transition policies receive rewards based on the outputs of proximity predictors. Before computing the reward at every time step, we clip the output of the proximity predictor D by clip(D(s), 0, 1) which indicates how close the state s is to the initiation set of the following primitive (higher values correspond to closer states). We define the proximity of a state to an initiation set as δ^step , where step is the shortest number of timesteps required to get to a state in the initiation set. We use δ = 0.95 for all experiments. To make the reward denser, for every timestep t, we provide the increase in proximity, D(s_{t+1}) − D(s_t), as a reward for transition policy.

Using a linearly discounted proximity function is also a natural choice instead of the exponential one. We compare the exponential and linear proximity functions on a manipulation task (Repetitive Catching) and a locomotion task (Obstacle Course). This demonstrates that our model is able to learn well with both proximity functions and they perform similarly. In this paper, we use the exponential proximity function for all experiments.

Supplemental Videos

Locomotion (Walker2d)




Patrol

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO





Obstacle Course

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO





Hurdle

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO


Manipulation (Jaco)




Serve

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO





Repetitive Catching

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO





Repetitive Picking

Transition Policy w/ Proximity Reward

Transition Policy w/ Task Reward

TRPO