Variational Dynamic for Self-Supervised Exploration

in Deep Reinforcement Learning

IEEE Transactions on Neural Networks and Learning Systems, 2021

Code available in Github

Videos

Real Robotic Manipulating Task

video-review-720.mov

Atari Games

Videos of last saved policy in standard Atari games. The extrinsic rewards is used to measure the performance. It is important to note that the extrinsic reward is only used for evaluation. Our method performs best in most games.

MontezumaRevenge-400.mp4

MontezumaRevenge

Gravitar-250.mp4

Gravitar

Venture-600.mp4

Venture

Solaris-2080.mp4

Solaris

Alien-1090.mp4

Alien

Asterix-3800.mp4

Asterix

Boxing-58.mp4

Boxing

Breakout-448.mp4

Breakout

MsPacman-850.mp4

MsPacman

Seaquest-1140.mp4

Seaquest

Hero-4080.mp4

Hero

Pong-2.mp4

Pong

Qbert-12975.mp4

Qbert

Riverraid-7520.mp4

Riverraid

SpaceInvaders-1685.mp4

SpaceInvaders

Tennis-1.mp4

Tennis

Robotank-8.mp4

Robotank

Frostbite-560.mp4

Frostbite

Atari Games with Sticky Actions

Videos of last saved policy in Atari games with sticky actions. This kind of environment is used to evaluate the robustness of exploration methods by introducing stochasticity in Atari games. In time step, the environment will execute the agent’s previous action again with probability 0.25.

The learning is more challenging in sticky Atari games. The performance of VDM is less affected by sticky actions by using the latent variables to encode the stochasticity of the environment.

MontezumaRevenge-0.mp4

MontezumaRevenge

Gravitar-750.mp4

Gravitar

Venture-0.mp4

Venture

Solaris-400.mp4

Solaris

Alien-430.mp4

Alien

Asterix-4900.mp4

Asterix

Boxing-50.mp4

Boxing

Breakout-398.mp4

Breakout

MsPacman-840.mp4

MsPacman

Seaquest-780.mp4

Seaquest

Hero-3415.mp4

Hero

Pong-6.mp4

Pong

Qbert-12625.mp4

Qbert

Riverraid-11760.mp4

Riverraid

SpaceInvaders-1155.mp4

SpaceInvaders

Tennis-1.mp4

Tennis

Robotank-10.mp4

Robotank

Frostbite-570.mp4

Frostbite

Super Mario

The Super Mario is a popular game that has several levels with different scenarios. We evaluate the transfer ability of pure-exploration policy by using the policy learned from Level-1 to adapt to other levels

level1-5546.mp4

Level-1

The level 1 of game has scenarios both in day and night.

We train VDM from scratch in Level 1.

level1-to-level2-1433.mp4

Level-2

Transfer the policy from Level 1 to Level 2

level1-to-level3-7328.mp4

Level-3

Transfer the policy from Level 1 to Level 3

Multi-Player Pong

Multiplayer-Pong-1.mp4

Both the sides of the game are controlled by the curiosity-driven agents and fight against each other. The stochasticity comes from the opponent of the game, because the policy of the opponent is also evolving.

Google Sites

Report abuse