Goal Misgeneralization

Why Correct Specifications Aren’t Enough For Correct Goals

CoinRun

CoinRun is a simple 2-D video game (platformer) where the goal is to collect the coin while dodging enemies and obstacles.

By default, the agent spawns at the leftmost end of the level, while the coin is always at the rightmost end. We modify CoinRun to allow the coin to be placed at other locations in the level. This allows us to vary the range of positions where the coin can spawn during both training and testing.

Default agent learns to go to the end of the level rather than getting the coin

We show cherry-picked videos of the agent trained with the coin only at the end of the level, and tested with the coin position randomly chosen to be anywhere in the level. This agent learns to go to the end of the level rather than picking up the coin. It visibly retains capabilities like dodging monsters and obstacles. The episode eventually times out, providing zero reward.

Also see: randomly selected videos.

Agent that sees a little diversity in coin position correctly generalizes to all coin positions

We show cherry-picked videos of the agent trained with the coin position randomly chosen to be in the right 20% of the level, and tested with the coin position randomly chosen to be anywhere in the level. This agent learns the intended goal of getting the coin regardless of where it's placed, and continues to be capable at dodging obstacles and monsters.

Also see: randomly selected videos.

Monster Gridworld

This RL environment is a 2D fully observed gridworld in which the agent must collect apples (+1 reward) while avoiding monsters (-1 reward) that chase the agent. The agent may also pick up shields for protection. When a monster makes contact while the agent has a shield (0 reward), the monster is destroyed and the shield is consumed.

Agent trained for 25 steps continues to collect shields for 200 steps even when monsters are gone

We show cherry-picked videos of the agent trained for 25 steps, and evaluated for 200 steps. This agent learns to pick up shields early in the episode when there are monsters present, but doesn't switch completely to collecting apples even once all the monsters are gone, continuing to collect a high number of extra shields.

Also see: randomly selected videos.

Agent trained for 100 steps switches to apples for 200 steps even when monsters are gone

We show cherry-picked videos of the the agent trained for 100 steps, and evaluated for 200 steps. This agent learns to pick up shields early in the episode when there are monsters present, but switches almost exclusively to collecting apples once the monsters are gone.

Also see: randomly selected videos.

Cultural Transmission

This is a 3D simulated environment where the intended behavior is to visit a randomly-selected permutation of goal locations (marked by large translucent spheres) in order. In training, agents are paired with an expert bot that always visits the goal locations in the correct order, and probabilistically drops out somewhere through the episode. Agents are trained using RL with a reward for visiting the correct next location and a penalty for visiting an incorrect next location.

Agent paired with an expert visits goal locations in the correct order

We show randomly picked videos of an agent paired with an expert bot. The agent does about as well as the expert.

Also see: more randomly selected videos.

Agent paired with an anti-expert visits goal locations in a pessimal order

We show randomly picked videos of an agent paired with an anti-expert bot (that visits goal locations in a pessimal order). The agent continues to follow the partner, performing far worse than a random policy. Note that as the agent also receives rewards as observations, it's straightforward in principle for the agent to realize that it is visiting the wrong locations.

Also see: more randomly selected videos.