Recent exploration methods have proven to be a recipe for improving sample-efficiency in deep reinforcement learning (RL). However, efficient exploration in high-dimensional observation spaces still remains a challenge. This paper presents Random Encoders for Efficient Exploration (RE3), an exploration method that utilizes state entropy as an intrinsic reward. In order to estimate state entropy in environments with high-dimensional observations, we utilize a k-nearest neighbor entropy estimator in the low-dimensional representation space of a convolutional encoder. In particular, we find that the state entropy can be estimated in a stable and compute-efficient manner by utilizing a randomly initialized encoder, which is fixed throughout training. Our experiments show that RE3 significantly improves the sample-efficiency of both model-free and model-based RL methods on locomotion and navigation tasks from DeepMind Control Suite and MiniGrid benchmarks. We also show that RE3 allows learning diverse behaviors without extrinsic rewards, effectively improving sample-efficiency in downstream tasks.
Representation space of randomly initialized encoders
Our main hypothesis is that a randomly initialized encoder can provide a meaningful representation space for state entropy estimation by exploiting the strong prior of convolutional architectures. Ulyanov et al. (2018) and Caron et al. (2018) found that the structure alone of deep convolutional networks is a powerful inductive bias that allows relevant features to be extracted for tasks such as image generation and classification. In our case, we find that the representation space of a randomly initialized encoder effectively captures information about similarity between states, as shown below. Based upon this observation, we propose to maximize a state entropy estimate in the fixed representation space of a randomly initialized encoder.
Visualization of k-nearest neighbors of states found by measuring distances in the representation space of a randomly initialized encoder (Random Encoder) and ground-truth state space (True State).
We present Random Encoders for Efficient Exploration (RE3), which encourages exploration in high-dimensional observation spaces by maximizing state entropy. The key idea of RE3 is k-nearest neighbor entropy estimation in the low-dimensional representation space of a randomly initialized encoder. To this end, we propose to compute the distance between states in the representation space of a random encoder whose parameters are randomly initialized and fixed throughout training.
RE3 consistently improves the sample-efficiency of RAD and Dreamer on various tasks and outperforms other exploration methods (ICM, RND) in DeepMind Control Suite.
(a) Comparison to RAD, DrQ, Dreamer
(b) Comparison to other exploration methods
Unsupervised pre-training and fine-tuning
The pre-trained policy using RE3 exhibits more diverse behaviors compared to random exploration or APT (Liu & Abbeel, 2021), where a policy is pre-trained to maximize state entropy estimate in contrastive representation space.
Fine-tuning on Hopper Hop
Fine-tuning on Hopper Stand
RE3 is more effective for improving the sample-efficiency of A2C, compared to other exploration methods, RND and ICM. Moreover, by pre-training a policy with RE3 on a large vacant room (Empty-16x16), we can get further improved downstream task performance, even in DoorKey tasks.