Team Members: Pooja Sharma, Darshan A.R., Pratheek S Upadhyaya.
The paper which we are reviewing, “Playing Atari with Deep Reinforcement Learning” combines the deep learning models with reinforcement learning to successfully learn control strategies directly from high-dimensional inputs. There have been several attempts in the past to combine DNNs and RLs, but this work seems to be the first to reach a meaningful level with almost human play. Specifically, this paper combines the convolution neural network and Q Learning and introduces a new algorithm called DQN algorithm. This algorithm demonstrated how an AI agent can learn to play games by just observing the screen without any prior information about those games.
Before we have a look at the algorithm, let’s discuss the results obtained in the paper. This is the result of applying the algorithm presented in this paper with a game console called Atari. The paper considers 7 Atari games. It then applies the newly developed algorithm to make the network to learn to play these games. 6 of the 7 games tested exceeded previous methods and several exceeded human levels. The tunneling strategy that starts at 2 minutes and 5 seconds is a real masterpiece.
In our experiment, we have taken 2 of the 7 games discussed in paper and have tried to obtain the result as reported in the paper.
Software Requirements
Before we discuss the implementation of DQN, we will describe the software required to construct and run the network.
Open AI Gym :
Gym is a library that can simulate many reinforcement learning environments, including Atari games. In our implementations we use the ‘Determinitstic-v4’ environments since this version is exactly equivalent to what the authors have used in their paper. ‘Deterministic-v4’ implements frame skipping mentioned in the paper, so we did not have to implement it on our own.
Keras: We will be using keras mainly for implementing the neural network which will be used to predict the Q value at each state.
Tensorflow: Keras uses Tensorflow as a backend.
Scikit-Image: This library is used in the preprocessing section to downsample the images generated by the gym environment.
Hardware Specifications:
We used Google Colaborator to carry out our simulations. Its specifications were:
GPU: 1xTesla K80 , having 2496 CUDA cores, compute 3.7, 12GB(11.439GB Usable) GDDR5 VRAM
CPU: 1xsingle core hyper threaded i.e(1 core, 2 threads) Xeon Processors @2.3Ghz (No Turbo Boost) , 45MB Cache
RAM: ~12.6 GB Available
Disk: ~320 GB Available
1. Convert RGB to gray scale and downscale it to 110x84 from 210x160 pixels.
2. Crop the image to an area of 84x84 that captures playing area.
3. Function from DQN Algorithm will apply preprocessing to the last 4 frames to get input to the Q function
INPUT- image of size 84x84x4 produced in preprocessing step.
1st HIDDEN LAYER- 16 Filters of size 8x8 with stride 4, Rectifier Non-Linearity [10,8]
2nd HIDDEN LAYER-32 Filters of size 4x4 with stride 2, Rectifier Non-Linearity [10,8]
OUTPUT LAYER - Fully Connected Layer consisting of 256 rectifier units producing Q value for each valid action which can vary between 4 to 18 depending on the game we consider.
Experience Replay : This is the key method that prevents DQN from diverging unlike previous methods. Experience replay allows for greater sample efficiency since each step stored can potentially be used in many weight updates. Secondly, it breaks the correlation between consecutive samples by randomly selecting them from the buffer. This is very important because it reduces the variance between updates and increases efficiency. But the downside is that a lot of memory is required to implement this.
Rewards : Here rewards are restricted to +1 and -1. Though it is effective, it is limited because it does not differentiate between rewards of different magnitudes. A more effective way to handle this was found to be "Huber Loss"[4].
TD-gammon[2] used Multi-layer perceptron for training and applied it to play backgammon games and achieved human level results. But the algorithm did not perform well in other games of chess and Go, which led people to think that the TD-gammon algorithm was only applicable to the special example of backgammon, which was not universal.
Essentially, the use of neural networks is to simulate a non-linear function. Fitting a model-free algorithm such as Q-learning to a non-linear function can easily cause Q-network to diverge. Hence, most of the work used linear function approximation which guaranteed good convergence.
NFQ[2] optimizes the loss function by using RPROP Algorithm which updates the parameters of network. But its computational cost is very high compared to DQN algorithm as it uses Batch update.
But the difference is that NFQ separates feature extraction and reinforcement learning. Features are extracted before training with NFQ. But DQN is End-to-End, enabling it to directly learn features that are relevant to discriminating action-values.
Huber Loss: Huber Loss is quadratic for small values and linear for large values. MSE is quadratic and emphasizes on minimizing large errors but this might have a perverse effect in DQN. If the error is large then the network will change to minimize it. But this will lead to a radical change in the target value, so the error might not have reduced as much as expected. An alternative loss function is mean absolute error(MAE) which prevents radical changes. But it doesn't penalize large errors more and isn't differentiable at 0. Huber Loss incorporates the advantages of both MSE and MAE.
Huber Loss Function is given by the following formula:
Target Network:
DQN depends on the policy to sample actions. But the policy keeps on changing as we explore more and learn more about the ground truth values of states and action. This means that our target outputs are also changing. This might lead to instability and divergence since we are trying to learn a mapping f for a constantly changing input and output. So, the target network helps us to fix the Q-value targets temporarily so we don't have a moving target to chase. This leads to better performance.
DDQN[5] is an improvement over DQN. Instead of using the action which gives the maximum reward in the calculation of our target, we use the action given by the target network since the max operation creates a positive bias towards the Q estimations.
DQN works best when the actions are discrete and finite. Recently many algorithms have emerged which handle continuous action space like actor critic algorithms and policy gradient methods.
Random Policy
Trained Policy
Random Policy
Trained Policy
As we can see the from the above graphs, our results are matching with the ones presented in the paper. The authors have considered the average reward over a number of games to be their performance metric and have compared it with human players and other algorithms like SARSA, HNeat Pixel etc. DQN has outperformed humans in games like Pong, Breakout and Enduro but fell short in the other games since they require the network to find a strategy that extends over long time scales. Their approach gave state-of-the-art results in six of the seven games it was tested on, with no adjustment of the architecture or hyperparameters.
In our implementation of the two games (Pong, Breakout) we considered, we were able to satisfactorily come close to the results obtained by DeepMind. Each epoch we considered was equivalent to 50,000 parameter updates or 200,000 frames. We can see from the graphs that we were able to attain a score of 21 in Pong(the maximum score) with increase in number of epochs. On breakout we attained a score close to 250, which is in the vicinity of the score obtained by DeepMind. The average Q for breakout has been plotted for the sake of comparison. The GIFs show how the agent plays before training(random policy) and after training using the DQN algorithm.
[1] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602
[2] Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
[3] Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, pages 317–328. Springer, 2005.
[4] Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
[5] van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double Q-learning. arXiv preprint arXiv:1509.06461, 2015
[6] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
[7] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.