Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks

What do you find here?

This post is the complementary material of our paper "Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks", where we incorporate, analyse and demonstrate the use of intrinsic motivation signals, such as novelty and surprise, with an image-based deep reinforcement learning method. We aim to use intrinsic stimuli to encourage the agent to learn a faster and better policy in both simulation and the real world without any human demonstration. We use a convolutional Autoencoder's basic structure and an actor-critic style model-free reinforcement algorithm. The autoencoder, as well as the agent, are trained simulations. 

In this post, you can find the reward curves, analysis of results and several videos of the agents solving a bunch of tasks where the only available information comes directly from IMAGES. 

NaSA-TD3 Architecture

One of the objectives set in our implementations is to have an algorithm that is easy to train. That is why we present this architecture comprising an encoder, a decoder and an actor and critic structure.

The encoder network consists of four convolutional layers with 32 filters with a kernel size of 3x3 and ReLU as an activation function. The output of the convolutional layer is flattened and routed to a fully connected layer and a normalization layer with a Tahn activation function. The Decoder network is a deconvolutional mirror of the Encoder with Sigmoid as the final activation function. The TD3 network consists of an actor network and two critic networks.  All three networks have two hidden fully connected layers with 1024 nodes each with ReLU. The actor has Tanh as an activation function for the output layer.  The predictive ensemble model has two hidden layers with 512 nodes and ReLU, and the output layer has the size of the latent z vector. 

The solid lines represent the forward calculation,  where a graphic representation of how and where we update the gradients can be seen in the dotted lines.  Notice the actor's gradients are not allowed to update the whole encoder networks, while the critic's gradients update the critic networks as well the encoder, the same case for the decoder network.

Novelty detection diagram. At each time step, an observation is passed to the encoder. The decoder receives the z latent presentation and creates a reconstruction of the original observation. SSIM is calculated between the reconstruction and the original observation.

Ensemble of Predictive Model Architecture. Each model predicts the next zt+1 latent presentation then the mean of the prediction is calculated.

In order to determine the correct and best z size, we ran a trial-error experiment where we trained the same task under the same condition, just changing the latent Z size.

Experiments

Ball in the cup

Cartpole Balance

Finger Spin

Reacher

Cheetah Run 


Simulated Environments

We train and test our algorithm in six complex continuous control tasks from DeepMind Control Suite. Some of these environments have a very space reward, while others are complex to solve or involve contact or balance.


Walker

Results From SImulation

Real-Wold Experiments

Illustration of the manipulation task we are solving

Dexterous Manipulation Task

Why this task?

This task seems easy to solve; in fact, it is FOR HUMANS. But, If it is analyzed stopped, it is a complex task that involves motor and visual coordination; the fingers must move in synchrony to avoid cancellation of the rotational movement. In addition, the agent must identify the position of the valve and move it to the desired position.

And we are trying to solve this with a low-cost gripper.

4 DoF Gripper

Our Gripper

Our 4-DoF robot gripper has two identical fingers equipped with Dynamixel XL-320 servomotors and a standard webcam placed on top of the structure. Additionally, an adjustable structure allows moving the position of the camera or the gripper to different places. 


If you would like to replicate this gripper for your experiments, the STL files and instructions for 3D printing can be found by clicking the button below.

Results - Real Robot

Performance curves during the training. We regularly evaluate the agent's performance during the training; after every 10K training steps, we compute the average reward over ten episodes. The training is done end-to-end on the real robot for both scenarios (valve resetting and valve no resetting) without pre-trained models or human demonstration.

Top view  valve no-reset

Top view  valve reset

Note: Aruco Markers no be used with this experiment. These markers are part of the robot setup, but they are not sensed or used in our implementation

Hyperparameters ANd Training Algorithm

Citation
If you use either the code, data or files from this blog or our paper in your project, please kindly star our repo and cite our work as: