D2RL: Deep Dense Architectures in Reinforcement Learning

Samarth Sinha*, Homanga Bharadhwaj*, Aravind Srinivas, and Animesh Garg

University of Toronto, Vector Institute, University of Toronto Robotics Institute

University of California Berkeley

While improvements in deep learning architectures have played a crucial role in improving the state of supervised and unsupervised learning in computer vision and natural language processing, neural network architecture choices for reinforcement learning remain relatively under-explored.

We take inspiration from successful architectural choices in computer vision and generative modeling, and investigate the use of deeper networks and dense connections for reinforcement learning on a variety of simulated robotic learning benchmark environments.

Our findings reveal that current methods benefit significantly from dense connections and deeper networks, across a suite of manipulation and locomotion tasks, for both proprioceptive and image-based observations. We hope that our results can serve as a strong baseline and further motivate future research into neural network architectures for reinforcement learning.


Motivation: The effect of increasing the number of fully-connected layers to parameterize the policy and Q-Networks for Soft-Actor Critic on Ant-v2 in the OpenAI Gym Suite. It is evident that performance drops when increasing depth after 2 layers, for the fully-connected networks. However, we want to train deeper networks for enabling better feature extraction and learning. We, see that our D2RL agent with 4 layers does not suffer from this performance drop, and performs better than all the fully-connected network based agents, irrespective of depth.

The D2RL Architecture

Overview of the D2RL architecture for policy and Q networks: The inputs are passed to each layer of the neural network through identity mappings. Forward pass corresponds to moving from left to right in the figure. For state-based envs, s_t is the observed simulator state and there is no convolutional encoder.

Results on continuous control environments

Results in some challenging robot control environments: Comparison of the proposed variation D2RL and baselines on a suite of challenging manipulation and locomotion environments. We apply the D2RL modification to the SAC, HER, and HIRO algorithms and compare relative performance in terms of average episodic rewards with respect to the baselines. The task complexity increases from Fetch Reach to Fetch Slide. Jaco Reach is challenging due to high-dimensional torque controlled action space, AntMaze requires exploration to solve a temporally extended problem, and Furniture BlockJoin requires solving two tasks- join and lift sequentially. The error bars are with respect to 5 random seeds.




Results on DeepMind control suite benchmark environments from images (CURL) and states (SAC). Results of CURL, CURL-D2RL, SAC, and SAC-D2RL, on the standard DM Control Suite benchmark environments. CURL and CURL-D2RL are trained purely with pixel observations while SAC and SAC-D2RL are trained with proprioceptive features. The results for CURL are taken directly as reported in the original paper. The S.D. is over 5 random seeds.

PyTorch code snippet for the Policy and Q-Network architectures of D2RL. This can be incorporated directly in standard actor-critic RL algorithms like SAC, DDPG etc.

For more details, please check out our paper and refer to the linked github repo