We embed a few examples for the comparison between our approach, traditional CNN, and S3TA on interpretability. We provide all the videos from 57 ATARI games at the end of the website.
We show examples where S3TA visualization is considered interpretable (there are some recognizable correlations between attention mask and objects in visual input )
Compared to traditional CNN or S3TA visualization, our proposed network has an accurate, clear, and fully interpretable attention mask that represents "where" and "what" the model is focusing on in the decision-making process.
Pong
Road Runner
Space Invaders
Hero
We show examples where S3TA is considered uninterpretable, note that all the S3TA agent still has a good performance on the examples
Analyzing the S3TA performance on Interpretability, we found some games that the model can have good scoring performances (the Human Normalized Score of S3TA are (1) Boxing 743.6%, (2) Ice Hockey 64.1% (3) Chopper Command 12.3%, and (4) Crazy Climber 643.9%) but the attention is uninterpretable (the attention mask is generated almost randomly).
Boxing
Ice Hockey
Chopper Command
Crazy Climber
We integrate the approach on A3C LSTM framework which has different aspects with Rainbow such as learning category (on-policy), observation configuration (single grayscale frame) and network components (LSTM layer). A similar visualization which is accurate and highly interpretable is also observed
We release the visualization of all 57 games in Gym ATARI. Note that we use Gym OpenAI v0.18.0 which supports Pooyan and does not support Surround.