Figure 1: [Top] Combined value function learned by deep skill chaining. [Bottom] Value functions learned by discovered skills. In this U-shaped maze, the goal state is in the top-left and the start state is in the bottom-left
While modern RL algorithms have achieved impressive results on hard problems, they have struggled in long-horizon problems with sparse rewards. Hierarchical reinforcement learning is a promising approach to overcome these challenges.
While the benefit of using hierarchies has been known for a long time, the question of how useful hierarchies can be discovered autonomously has remained largely unanswered. In this work, we present an algorithm that can construct temporally extended, higher level actions (called skills) from the set of primitive actions already available to the RL agent.
Not only is the ability to break down complex problems into simpler sub-problems a hallmark of intelligence, it is also the missing piece from traditional/flat reinforcement learning techniques. By constructing useful hierarchies, RL agents will be able to combine modular solutions to easy sub-problems to reliably solve hard real-world problems.
We propose Deep Skill Chaining as a step towards realizing the goal of autonomous skill discovery in high-dimensional problems with continuous state and action spaces.
Deep Skill Chaining models skills using the Options framework. It is based on the following intuition: a useful skill is one that either reliably solves the problem for you, or takes you to a state from which you have high confidence of solving the problem. In most interesting problems, a flat policy can only succeed with high probability from some local neighborhood of the goal. As a result, we first learn a modular skill policy that initiates near the goal and reliably takes the agent to the goal. Once such an option is learned, we create another option whose goal is to take the agent to a state from which it can execute the first option. Skills are chained backward in this fashion until the start state of the MDP lies inside the initiation set of some option. The inductive bias of creating sequentially executable skills guarantees that as long as the agent successfully executes each skill in its chain, it will solve the original task.
Consider the classic four-rooms problem, but with the following two twists:
Continuous state and action spaces
One room has a lock (visualized as a red sphere) and one room has a key (visualized as a blue sphere). The agent must first pick up the key and then navigate to the lock. There is no extrinsic reward associated with picking up the key and the agent gets a sparse reward only when it solves the overall problem
The sparsity of the reward function and the single time step dependence on the key make this an incredibly challenging problem for flat policy gradient methods such as DDPG. However, by decomposing the problem into a series of value functions, Deep Skill Chaining is able to reliably solve this problem with relative ease.
has_key = False
has_key = True
has_key = False
has_key = True
In the figure above, the left column corresponds to the first discovered option, while the right column corresponds to the second discovered option. The top row visualizes the initiation set when the agent does not have the key in its possession, while the bottom row visualizes the initiation set when the agent has already picked up the key. We can see that the option named overall_goal_policy has learned that it can only initiate when the agent already has the key in its possession. Once initiated, its job is to drive the agent to the lock. The option named option_1 on the other hand has a small region in the top-right room where it can initiate if it doesn't already have the key. But once it has picked up the key, it can initiate anywhere in the top-right room. Its goal then is to take the agent to the top-left room where it can execute the overall_goal_policy option, which in turn can solve the overall problem with high reliability.
Consider an option (shown in green below) whose termination condition is the initiation condition of the option that drives the agent to the goal state (red sphere below). The start states from successful trajectories (top row) are labeled as positive examples, while those from unsuccessful trajectories (bottom row) are labeled as negative examples. These examples are used to train a binary initiation classifier. This self-supervision scheme results in initiation regions from which option execution can succeed with high reliability.
Successful option trajectories: example trajectories in which the green option successfully drives the system to its learned sub-goal
Unsuccessful option trajectories: example trajectories in which the green option is unable to reach its sub-goal
By explicitly learning initiation set classifiers, deep skill chaining discovers skills that specialize in different regions of the state-space. As a result, skills do not bear the burden of learning representations for states that lie far outside of their initiation region. Figure 1 bottom row shows the value functions learned by the individually discovered skills in the point-maze domain. The value function visualized in the top row is one learned by making off-policy updates to an option whose initiation set classifier is true everywhere (called the global option; see paper for more details).
As a side note, the recently published On Catastrophic Interference in Atari 2600 Games argues that deep RL algorithms struggle on long-horizon problems in the ALE suite because of catastrophic forgetting in neural networks. Furthermore, they find that an effective solution to this problem is to permit different parts of the game to correspond to different value functions. Deep skill chaining does exactly that - it learns to solve long-horizon problems by combining modular, locally-valid value functions learned by individual options.
The following video shows some solution trajectories learned by deep skill chaining. As the agent switches between different options, it changes the color of the simulated robot. In the Point-E-Maze domain (the last domain in the video), the agent learns a skill tree, which reuses the goal option, but learns qualitatively different skills that drives the robot to states from where it can execute that goal option.
All modern option discovery methods assume that the number of skills required to solve a task is known apriori. Although an important parameter, this is also one that is highly unintuitive to determine. After all, how does one intuitively know how many skills are needed to solve a U-shaped maze with an ant robot vs a swimmer robot? Clearly, the setting of how many skills are needed to solve the problem is dependent on the difficulty of the problem as well as the policy class used to solve that problem. Rather than requiring the number of options as an input to the algorithm, deep skill chaining flexibly discovers as many options as it needs to solve the given problem most effectively. In our paper, we show that the DSC agent automatically discovers more options for harder problems.
We compare deep skill chaining (DSC) to a popular non-hierarchical RL algorithm, deep deterministic policy gradients (DDPG). The learning curves below show that even though DDPG usually finds some trajectories that lead to the goal, it is unable to capitalize on those experiences to reliably drive the agent to the goal state. By contrast, DSC capitalizes on a handful of successful DDPG rollouts to construct options. In doing so, DSC quickly and reliably solves problems which sparse rewards and long temporal dependencies.
We also show superior performance to a popular option discovery algorithm, Option Critic. A drawback of the option-critic framework is that all learning signal is derived from extrinsic rewards. In sparse-reward problems, like the ones considered in our paper, this prevents the agent from learning a useful hierarchy. By contrast, each option in the DSC framework is learned using its own reward function, which successfully overcomes the challenge posed by sparse rewards.
Recently, Hierarchical Actor Critic (HAC), a feudal-style skill discovery algorithm, outperformed other skill discovery algorithms on a wide variety of tasks in the MuJoCo simulator. Deep skill chaining outperforms HAC in four out of five domains, while achieving comparable performance in one. HAC depends on Hindsight Experience Replay to train the different layers of their hierarchy. Deep skill chaining shows the benefits of using hierarchies even in the absence of such data augmentation techniques but including them should yield additional performance benefits in sparse-reward tasks.
Effective skill discovery lies at the heart of the reinforcement learning problem. In this work, we present an algorithm that can solve high-dimensional goal-oriented tasks more reliably than flat RL agents and other popular hierarchical methods. The proposed algorithm breaks down long-horizon problems into a series of skills, each of which learns to initiate from states from which it has a high probability of succeeding. Furthermore, where other deep option discovery techniques have struggled to show consistent improvements over baseline flat agents in the single task setting, we unequivocally show the necessity for hierarchies for solving challenging problems.
Please refer to our paper for more details on our work and consider using our code base to play around with deep skill chaining.
Feel free to reach out with any questions or feedback you may have!