PLAYING ATARI WITH DEEP REINFORCEMENT LEARNING

Team Members: Pooja Sharma, Darshan A.R., Pratheek S Upadhyaya.

INTRODUCTION

The paper which we are reviewing, “Playing Atari with Deep Reinforcement Learning” combines the deep learning models with reinforcement learning to successfully learn control strategies directly from high-dimensional inputs. There have been several attempts in the past to combine DNNs and RLs, but this work seems to be the first to reach a meaningful level with almost human play. Specifically, this paper combines the convolution neural network and Q Learning and introduces a new algorithm called DQN algorithm. This algorithm demonstrated how an AI agent can learn to play games by just observing the screen without any prior information about those games.

Before we have a look at the algorithm, let’s discuss the results obtained in the paper. This is the result of applying the algorithm presented in this paper with a game console called Atari. The paper considers 7 Atari games. It then applies the newly developed algorithm to make the network to learn to play these games. 6 of the 7 games tested exceeded previous methods and several exceeded human levels. The tunneling strategy that starts at 2 minutes and 5 seconds is a real masterpiece.

In our experiment, we have taken 2 of the 7 games discussed in paper and have tried to obtain the result as reported in the paper.

PROCEDURE

Software Requirements

Before we discuss the implementation of DQN, we will describe the software required to construct and run the network.

Open AI Gym :

Gym is a library that can simulate many reinforcement learning environments, including Atari games. In our implementations we use the ‘Determinitstic-v4’ environments since this version is exactly equivalent to what the authors have used in their paper. ‘Deterministic-v4’ implements frame skipping mentioned in the paper, so we did not have to implement it on our own.

Keras: We will be using keras mainly for implementing the neural network which will be used to predict the Q value at each state.

Tensorflow: Keras uses Tensorflow as a backend.

Scikit-Image: This library is used in the preprocessing section to downsample the images generated by the gym environment.

Hardware Specifications:

We used Google Colaborator to carry out our simulations. Its specifications were:

GPU: 1xTesla K80 , having 2496 CUDA cores, compute 3.7, 12GB(11.439GB Usable) GDDR5 VRAM

CPU: 1xsingle core hyper threaded i.e(1 core, 2 threads) Xeon Processors @2.3Ghz (No Turbo Boost) , 45MB Cache

RAM: ~12.6 GB Available

Disk: ~320 GB Available

PRE-PROCESSING:

1. Convert RGB to gray scale and downscale it to 110x84 from 210x160 pixels.

2. Crop the image to an area of 84x84 that captures playing area.

3. Function from DQN Algorithm will apply preprocessing to the last 4 frames to get input to the Q function

ARCHITECTURE:

INPUT- image of size 84x84x4 produced in preprocessing step.

1^st HIDDEN LAYER- 16 Filters of size 8x8 with stride 4, Rectifier Non-Linearity [10,8]

2^nd HIDDEN LAYER-32 Filters of size 4x4 with stride 2, Rectifier Non-Linearity [10,8]

OUTPUT LAYER - Fully Connected Layer consisting of 256 rectifier units producing Q value for each valid action which can vary between 4 to 18 depending on the game we consider.

ALGORITHM:

Training is divided into M episodes, and each episode is trained T times. Restarting a new episode is mainly to initialize the state as the new first, rather than using the last state of the previous round as the state input.
An agent can either act randomly or greedily. Acting randomly known as exploring, is important as it will help to explore new states which are not encountered till now. But as we learn more about the environment we need to act greedily and exploit whatever we have learnt. So we choose an action based on the maximum Q-value.
We can balance exploration/exploitation using epsilon (ε) which will act as weightage for them. We decay epsilon as the number of iterations increase but we fix a minimum value so that some amount of exploration is always present.
Each loop is divided into two parts: One part is to output the action and store it and the other part is to randomly take mini batch transitions from the experience pool, then calculate the target, and update the parameters through RMSProp based on the loss function.
After we calculate the maximum target Q ,discount it so that the future reward is worth less than immediate reward. Decay Rate mentioned below(γ) is used as the discount factor.
We add the current reward to the discounted future reward to get the target value. Subtracting our current prediction from the target gives the loss.

EXPERIMENT & ANALYSIS:

Test 2 games
The rewards of different games are unified, the positive is 1, the negative is -1, and the others are 0. The benefit of doing this is to limit the proportion of errors.
Using the RMSprop algorithm as one of the mini-batch gradient descent methods with mini-batch size 32 . Divide the gradient by a running average of its recent magnitude.
ε -greedy drops from 1 to 0.1 for the first 1 million times and then stays the same. In the beginning, it is more random search, and then gradually use the optimal method.
Trained for a total of 10 million frames and used a replay memory of one million most recent frames
Use the frame-skipping technique , which means that the action is performed only once every k frames. In actual research, if an action is output every frame, the frequency is too high, which will basically cause failure. Here, the action used in the frame skipped in the middle is the last action before. This is consistent with human behavior, human response time is only 0.1, and the same approach is adopted. And doing so can speed up noticeably. Well, here Deepmind chooses k = 4, which means output an action every 4 frames.

Experience Replay : This is the key method that prevents DQN from diverging unlike previous methods. Experience replay allows for greater sample efficiency since each step stored can potentially be used in many weight updates. Secondly, it breaks the correlation between consecutive samples by randomly selecting them from the buffer. This is very important because it reduces the variance between updates and increases efficiency. But the downside is that a lot of memory is required to implement this.

Rewards : Here rewards are restricted to +1 and -1. Though it is effective, it is limited because it does not differentiate between rewards of different magnitudes. A more effective way to handle this was found to be "Huber Loss"[4].

RELEVANT WORK

TD GAMMON:

TD-gammon[2] used Multi-layer perceptron for training and applied it to play backgammon games and achieved human level results. But the algorithm did not perform well in other games of chess and Go, which led people to think that the TD-gammon algorithm was only applicable to the special example of backgammon, which was not universal.

Essentially, the use of neural networks is to simulate a non-linear function. Fitting a model-free algorithm such as Q-learning to a non-linear function can easily cause Q-network to diverge. Hence, most of the work used linear function approximation which guaranteed good convergence.

NEURAL FITTED Q-LEARNING

NFQ[2] optimizes the loss function by using RPROP Algorithm which updates the parameters of network. But its computational cost is very high compared to DQN algorithm as it uses Batch update.

But the difference is that NFQ separates feature extraction and reinforcement learning. Features are extracted before training with NFQ. But DQN is End-to-End, enabling it to directly learn features that are relevant to discriminating action-values.

Use of Target Network and Huber Loss in DQN [4]

Huber Loss: Huber Loss is quadratic for small values and linear for large values. MSE is quadratic and emphasizes on minimizing large errors but this might have a perverse effect in DQN. If the error is large then the network will change to minimize it. But this will lead to a radical change in the target value, so the error might not have reduced as much as expected. An alternative loss function is mean absolute error(MAE) which prevents radical changes. But it doesn't penalize large errors more and isn't differentiable at 0. Huber Loss incorporates the advantages of both MSE and MAE.

Huber Loss Function is given by the following formula:

Target Network:

DQN depends on the policy to sample actions. But the policy keeps on changing as we explore more and learn more about the ground truth values of states and action. This means that our target outputs are also changing. This might lead to instability and divergence since we are trying to learn a mapping f for a constantly changing input and output. So, the target network helps us to fix the Q-value targets temporarily so we don't have a moving target to chase. This leads to better performance.

Double DQN

DDQN[5] is an improvement over DQN. Instead of using the action which gives the maximum reward in the calculation of our target, we use the action given by the target network since the max operation creates a positive bias towards the Q estimations.

DQN works best when the actions are discrete and finite. Recently many algorithms have emerged which handle continuous action space like actor critic algorithms and policy gradient methods.

Results and Comparison

Random Policy

Trained Policy

Random Policy

Trained Policy

As we can see the from the above graphs, our results are matching with the ones presented in the paper. The authors have considered the average reward over a number of games to be their performance metric and have compared it with human players and other algorithms like SARSA, HNeat Pixel etc. DQN has outperformed humans in games like Pong, Breakout and Enduro but fell short in the other games since they require the network to find a strategy that extends over long time scales. Their approach gave state-of-the-art results in six of the seven games it was tested on, with no adjustment of the architecture or hyperparameters.

In our implementation of the two games (Pong, Breakout) we considered, we were able to satisfactorily come close to the results obtained by DeepMind. Each epoch we considered was equivalent to 50,000 parameter updates or 200,000 frames. We can see from the graphs that we were able to attain a score of 21 in Pong(the maximum score) with increase in number of epochs. On breakout we attained a score close to 250, which is in the vicinity of the score obtained by DeepMind. The average Q for breakout has been plotted for the sake of comparison. The GIFs show how the agent plays before training(random policy) and after training using the DQN algorithm.

REFERENCES

[1] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602

[2] Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.

[3] Martin Riedmiller. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In Machine Learning: ECML 2005, pages 317–328. Springer, 2005.

[4] Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

[5] van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double Q-learning. arXiv preprint arXiv:1509.06461, 2015

[6] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

[7] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

Page updated

Google Sites

Report abuse