A Deep Continuous Distributional Actor-Critic Agent with Kalman Fusion of Critics for Reinforcement Learning

What do you find here?

This post is the complementary material of our paper "A Deep Continuous Distributional Actor-Critic Agent  with Kalman Fusion of Critics for Reinforcement Learning.


This post is the complementary material of our paper "A Deep Continuous Distributional Actor-Critic Agent with Kalman Fusion of Critics for Reinforcement Learning".  We aim through a continuous distribution to encourage an RL agent to learn faster and better policy without any human demonstration. We mitigate the problems related to categorical distributions by presenting an algorithm that is easy to train with effective results.  As an additional contribution to mitigate overestimation bias, we present an ensemble of multiple critics fused through a Kalman filter mechanism.  


In this post, you can find the reward curves, analysis of results and several videos of the agents solving a bunch of tasks where the only available information comes directly from IMAGES. 

In categorical RL, the process of computing the loss function is

not straightforward.  The current distribution and the target distribution have disjoint supports. The target distribution supports are modified

by the reward and the discount factor.  Therefore,

the minimization of the  loss is not always directly possible, needing a complex  projecting or approximation step to match the target supports onto the current prediction supports.


In CTD4, the critic network structure is modified to output

the µ and σ that parameterize the normal distribution of the Z approximation, making it a continuous distributional RL algorithm where no projection or complex steps are needed to reduce the loss function between the current and target estimation.

CTD4  Architecture

One of the objectives set in our implementations is to have an algorithm that is easy to train.  We present this architecture, as an actor with multiple continuous distributed critics. The solid lines represent the forward calculation; the divisions between the actor and critics can be seen in the dotted lines. 


The CTD4 architecture network consists of an actor network and N critic networks. 

In order to determine the correct and best fusion method, we ran an experiment where we trained the same task under the same condition, just changing the fusion method.

In the same way,  a comparative analysis was conducted in order to determine the optimal ensemble size. An experiment where we trained the same task under the same conditions, just changing the number of critics in the ensemble. 

Hyperparameters and Training Algorithm

Experiments

Acrobot Swingup

Ball in cup Catch

Carpole Swingup

Cheetah Run

Finger Turn Hard

Fish Swim

Hopper Hop

Humanoind

Walker Walk 

Tested Environments

 We train and test our algorithm in ten complex continuous control tasks from DeepMind Control Suite. Some of these environments have a very space reward, while others are complex to solve or involve contact or balance.


Reacher Hard

Results