The reaction of an agent trained on Seaquest to the introduction of a new enemy (fish).
Note that the fish is inserted at the pixel level, not at the engine level, so the agent can't actually interact with it.
This video shows the attention maps on enduro colored depending on whether the query is more spatial-based (blue), more content-based (red), or balanced between the two (white). We do this by computing for each pixel the sum of the logits in the spatial channels and in the content channels and taking the difference of the logits. We truncate the difference in the range [-log(10), log(10)] and then weight each pixel by the attention weights for the frame. In each frame the minimum value is assigned bright blue and the maximum is assigned bright red.
In this video we show the saliency maps on Ms Pacman for a baseline agent and for our attention agent. For the attention agent, we also show the most similar attention map on the same frame. The frames are not aligned between the agents (because the agents act with different policy), but they go through a range of similar situations.