This website contains supplementary material for the paper "Same State, Different Task: Continual Reinforcement Learning without Conflicting Gradients". We show some of the learned policies from our experiments.
As we see in Figure 1, we want to model the multi-modality of the two different objectives by using a mixture of linear regressions. This motivates our use of separate heads for each task!
Moving to the MiniGrid experiments, recall we tested OWL (and Exp-Replay) on three MiniGrid environments. In each case, the wall is in a different location, which means the agent has to move in a different direction to reach the goal.
Below shows the final OWL agent solving the three training tasks, demonstrating the results from Figure 5 lead to efficient RL policies.
We scale to 5 different tasks with mixture of SimpleCrossing and DoorKey levels. We can see that using the multi-armed bandit is still able to pick the correct policy to solve each environment after having learnt continually. Below we demonstrate the results from Figure 11.
Next we also tested the final OWL agent on novel tasks, from both the same crossing setup (with a single wall, but in different locations), and more challenging grids with more walls. Below we show successful policies, which clearly switch between fixed behavior policies, demonstrating a link between our approach and hierarchical RL.
Of course, it is often the case OWL doesn't reach the wall, but encouragingly it still does something sensible...
We are excited by these results as we believe it may be possible to scale OWL to much larger, more complex settings, potentially continually learning behaviors/skills within a single environment.