Abstract
Hierarchical world models can significantly improve model-based reinforcement learning (MBRL) and planning by enabling reasoning across multiple time scales. Nonetheless, the majority of state-of-the-art MBRL methods employ flat, non-hierarchical models. We propose Temporal Hierarchies from Invariant Context Kernels (THICK), an algorithm that learns a world model hierarchy via discrete latent dynamics. The lower level of THICK updates parts of its latent state sparsely in time, forming invariant contexts. The higher level exclusively predicts situations involving context changes. Our experiments demonstrate that THICK learns categorical, interpretable, temporal abstractions on the high level, while maintaining precise low-level predictions. Furthermore, we show that the emergent hierarchical predictive model seamlessly enhances the abilities of MBRL or planning methods. We believe that THICK contributes to the further development of hierarchical agents capable of more sophisticated planning and reasoning abilities.
C-RSSM
The lower level of a THICK world model is composed of a context-sensitive RSSM (C-RSSM), a novel extension to the Recurrent State Space Model (RSSM, Hafner et al., 2019). The C-RSSM uses an internal gating mechanism and a second processing pathway to update parts of its latent state, its context c, only sparsely in time.
As a result, the context is only updated to encode crucial situations, in which some unobservable information needs to be memorized to reconstruct the present or predict the future. Depending on the environment, this includes finding objects, picking up items, moving obstacles, exploring new areas or changing the map. Below we provide example sequences for different problems. We visualize the 16-dimensional context as a grayscale 4x4 matrix:
MiniHack-KeyRoom: inputs and 16-dim. context
Multiworld-Pusher: inputs and 16-dim. context
MiniHack-EscapeRoom: inputs and 16-dim. context
Multiworld-Door: inputs and 16-dim. context
Hierarchical Predictions
The sparse context changes discretize a sequence into segments with a constant context activation. We use this segmentation to create input-target pairs to train a high-level world model W. The high-level model is trained to predict the states immediately before context changes on the low level.
As a result, THICK world models can make predictions on both levels of the hierarchy. The low-level world model predicts the next next state. The high-level world model predicts some state in the future for which the low-level context is expected to change. We can visualize both predictions via the output heads trained for image reconstruction:
Multiworld-Door: inputs (left), low-level predictions (center) and high-level predictions (right)
MiniHack-River: inputs (left), low-level predictions (center) and high-level predictions (right)
The high-level world models learns to encode meaningful temporal abstractions, such as grasping a door handle (Multiworld-Door), or fetching a boulder, pushing it into water, and exiting the level (MiniHack-River). Below we provide more example of high-level predictions, for a certain input at time t.
MiniHack-KeyRoom: inputs and high-level predictions
Multiworld-PickUp: inputs and high-level predictions
MiniHack-EscapeRoom: inputs and high-level predictions
VisualPinPad-Five: inputs and high-level predictions
So far in all examples the high-level prediction were sampled from a categorical distribution of high-levels "actions" A or temporal abstract outcomes. Below we illustrate for two examples how high-level actions encode different outcomes (MiniHack-WandOfDeath: attacking a monster vs. exiting the room or VisualPinPad-Five: stepping on different pads).
MiniHack-WandOfDeath:input image(left), high-level predictions for action A1 (center) and A2 (right)
VisualPinPad-Five:input image(left), high-level predictions for action A1 (center) and A2 (right)
Hierarchical Planning
The high-level action encodings can be used to plan hierarchically. Here the higher level selects high-level actions to reach the overall goal. The prediction for the best high-level action is used as a subgoal to guide planning on the lower level. Below we illustrate the goals set by the higher level during the planning process:
Multiworld-Pusher: inputs (left) and goals (right)
Multiworld-PickUp: inputs (left) and goals (right)
Multiworld-Door: inputs (left) and goals (right)
Multiworld-PickUp: inputs (left) and goals (right)
Code
Code is available at https://github.com/CognitiveModeling/THICK