Goal Misgeneralization
Why Correct Specifications Aren’t Enough For Correct Goals
Why Correct Specifications Aren’t Enough For Correct Goals
CoinRun is a simple 2-D video game (platformer) where the goal is to collect the coin while dodging enemies and obstacles.
By default, the agent spawns at the leftmost end of the level, while the coin is always at the rightmost end. We modify CoinRun to allow the coin to be placed at other locations in the level. This allows us to vary the range of positions where the coin can spawn during both training and testing.
We show cherry-picked videos of the agent trained with the coin only at the end of the level, and tested with the coin position randomly chosen to be anywhere in the level. This agent learns to go to the end of the level rather than picking up the coin. It visibly retains capabilities like dodging monsters and obstacles. The episode eventually times out, providing zero reward.
Also see: randomly selected videos.
We show cherry-picked videos of the agent trained with the coin position randomly chosen to be in the right 20% of the level, and tested with the coin position randomly chosen to be anywhere in the level. This agent learns the intended goal of getting the coin regardless of where it's placed, and continues to be capable at dodging obstacles and monsters.
Also see: randomly selected videos.
This RL environment is a 2D fully observed gridworld in which the agent must collect apples (+1 reward) while avoiding monsters (-1 reward) that chase the agent. The agent may also pick up shields for protection. When a monster makes contact while the agent has a shield (0 reward), the monster is destroyed and the shield is consumed.
We show cherry-picked videos of the agent trained for 25 steps, and evaluated for 200 steps. This agent learns to pick up shields early in the episode when there are monsters present, but doesn't switch completely to collecting apples even once all the monsters are gone, continuing to collect a high number of extra shields.
Also see: randomly selected videos.
We show cherry-picked videos of the the agent trained for 100 steps, and evaluated for 200 steps. This agent learns to pick up shields early in the episode when there are monsters present, but switches almost exclusively to collecting apples once the monsters are gone.
Also see: randomly selected videos.
This is a 3D simulated environment where the intended behavior is to visit a randomly-selected permutation of goal locations (marked by large translucent spheres) in order. In training, agents are paired with an expert bot that always visits the goal locations in the correct order, and probabilistically drops out somewhere through the episode. Agents are trained using RL with a reward for visiting the correct next location and a penalty for visiting an incorrect next location.
We show randomly picked videos of an agent paired with an expert bot. The agent does about as well as the expert.
Also see: more randomly selected videos.
We show randomly picked videos of an agent paired with an anti-expert bot (that visits goal locations in a pessimal order). The agent continues to follow the partner, performing far worse than a random policy. Note that as the agent also receives rewards as observations, it's straightforward in principle for the agent to realize that it is visiting the wrong locations.
Also see: more randomly selected videos.