Skill-Critic: Refining Learned Skills for Reinforcement Learning

Ce Hao, Catherine Weaver, Chen Tang, Kenta Kawamoto, Masayoshi Tomizuka, Wei Zhan

Abstract

Hierarchical reinforcement learning (RL) can accelerate long-horizon decision-making by temporally abstracting a policy into multiple levels. Promising results in sparse reward environments have been seen with skills, i.e. sequences of primitive actions. Typically, a skill latent space and policy are discovered from offline data, but the resulting low-level policy can be unreliable due to low-coverage demonstrations or distribution shifts. As a solution, we propose fine-tuning the low-level policy in conjunction with high-level skill selection. Our Skill-Critic algorithm optimizes both the low and high-level policies; these policies are also initialized and regularized by the latent space learned from offline demonstrations to guide the joint policy optimization. We validate our approach in multiple sparse RL environments, including a new sparse reward autonomous racing task in Gran Turismo Sport. The experiments show that Skill-Critic's low-level policy fine-tuning and demonstration-guided regularization are essential for optimal performance.

Approach

Hierarchical RL from a demonstration-guided latent space.

Left: Offline data informs the skill embedding model with skill encoder (yellow), skill prior (green), and skill decoder (blue). Hyperparameter σaˆ is augmented to the decoder to define the action prior.

Right: HL (red) and LL (purple) policies are fine-tuned on downstream tasks via our skill-critic algorithm. During finetuning, the HL and LL policies are regularized with the skill and action priors, respectively.

Environments

We evaluate skill-critic in three tasks: maze navigation, autonomous racing, and robotic manipulation. For each environment, we collect an offline dataset for skill training (Stage 1), and test skill improvement on more complex target tasks (Stage 2). During Stage 1, demonstrations can inform useful skills in each environment. In Stage 2, the agent faces a sparse reward task that should be completed as fast as possible. The agent can leverage the offline skills, but must improve the skills in order to achieve the highest reward.

Maze Navigation

Gran Turismo Sport

Robotic Manipulation

Results

Maze Navigation and Trajectory Planning

Left: average episode rewards with 3 random seeds. Skill-critic begins training at 1 million steps and is warm-started by SPiRL.

Right: SPiRL and skill-critic trajectories at convergence. SPiRL reuses right-angle movements from the skills, resulting in slow, jagged paths. Skill critic discovers new skills to plan diagonal trajectories.

Autonomous Racing

Autonomous racing results in Gran Turismo Sport. Left: episode reward. Right: cumulative time in contact with wall at track boundary per episode.

Finish Times:

Demonstration: the vehicle is controlled by a built-in rule-based controller to follow a fixed reference path.

SAC: The vehicle oscillates and frequently crashes the wall.

ReSkill: The policy frequently oscillates and is slow to finish the race.

BC+SAC: The vehicle simply stays still since warm-starting SAC with BC hinders exploration.

SPiRL: The vehicle keeps scratching the wall since the skills learned from built-in AI is not sufficient for avoiding it.

Skill-Critic: Our skill-critic agent is able to improve the learned skills during fine-tuning and avoid collisions.

Skill-Critic can race as well as the high-quality Built-In AI in Gran Turismo Sport, even though

Skill-Critic learns skills from low speed offline demonstrations
Skill-Critic must learn from a sparse reward

Skill-Critic can leverage the offline skills AND improve them to race competitively, whereas existing baselines struggle to improve their performance.

Robotic Manipulation

Slippery Push

Table Cleanup

Skill-Critic can outperform existing hierarchical RL baselines at robotic manipulation tasks.

SPiRL: The low-level policies are not enough for the high-level policy to learn to push the block on a more slippery surface.

ReSkill: ReSkill learns to move the block on the slippery surface, but is slow to place the block in the goal area.

Skill-Critic: Skill-Critic develops a unique way of placing the block in the goal area as fast as possible.