Variational Dynamic for Self-Supervised Exploration
in Deep Reinforcement Learning
IEEE Transactions on Neural Networks and Learning Systems, 2021
IEEE Transactions on Neural Networks and Learning Systems, 2021
Videos of last saved policy in standard Atari games. The extrinsic rewards is used to measure the performance. It is important to note that the extrinsic reward is only used for evaluation. Our method performs best in most games.
MontezumaRevenge
Gravitar
Venture
Solaris
Alien
Asterix
Boxing
Breakout
MsPacman
Seaquest
Hero
Pong
Qbert
Riverraid
SpaceInvaders
Tennis
Robotank
Frostbite
Videos of last saved policy in Atari games with sticky actions. This kind of environment is used to evaluate the robustness of exploration methods by introducing stochasticity in Atari games. In time step, the environment will execute the agent’s previous action again with probability 0.25.
The learning is more challenging in sticky Atari games. The performance of VDM is less affected by sticky actions by using the latent variables to encode the stochasticity of the environment.
MontezumaRevenge
Gravitar
Venture
Solaris
Alien
Asterix
Boxing
Breakout
MsPacman
Seaquest
Hero
Pong
Qbert
Riverraid
SpaceInvaders
Tennis
Robotank
Frostbite
The Super Mario is a popular game that has several levels with different scenarios. We evaluate the transfer ability of pure-exploration policy by using the policy learned from Level-1 to adapt to other levels
The level 1 of game has scenarios both in day and night.
We train VDM from scratch in Level 1.
Transfer the policy from Level 1 to Level 2
Transfer the policy from Level 1 to Level 3
Both the sides of the game are controlled by the curiosity-driven agents and fight against each other. The stochasticity comes from the opponent of the game, because the policy of the opponent is also evolving.