Bohan Zhou, Ke Li, Jiechuan Jiang, Zongqing Lu
PKU BAAI
Learning from visual observation (LfVO), aiming at recovering policies from only visual observation data, is promising yet a challenging problem. Existing LfVO approaches either only adopt inefficient online learning schemes or require additional task-specific information like goal states, making them not suited for open-ended tasks.
STG for LfVO
We propose a two-stage framework for learning from visual observation. The first stage involves three concurrently pretrained components. A feature encoder is trained in a self-supervised manner to provide easily predicted and temporally aligned representations for stacked-image states. State-to-Go (STG) Transformer is trained in an adversarial way to accurately predict transitions in latent space. A discriminator is updated simultaneously to distinguish state transitions of prediction from expert demonstrations, which provides high-quality intrinsic rewards for downstream online reinforcement learning in the second stage.
State-To-Go Transformer
Built upon GPT, State-To-Go (STG) Transformer primarily focuses on predicting the next state embedding given a sequence of states. A additional self-supervised auxiliary module, named temporal distance regressor (TDR), with 1D attention, is devised to ensure the temporally-aligned visual embedding. A learned WGAN-based discriminator distinguish between expert and non-expert transitions without collecting online negative samples, providing an offline way to generate intrinsic rewards for a PPO agent to complete downstream reinforcement learning tasks.
STG for Atari Visual Control Tasks
Breakout
Freeway
Qbert
SpaceInvaders
STG for Open-Ended Minecraft Tasks
Pick a flower
Milk a cow
Harvest tallgrass
Gather wool
Empirical results on Atari and Minecraft demonstrate excellent capabilities in solving LfVO problems, which shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards.
}