# DEMYSTIFYING DEEP REINFORCEMENT LEARNING

This is the part 1 of my series on deep reinforcement learning. See part 2 “Deep Reinforcement Learning with Neon” for an actual implementation with Neon deep learning toolkit.

Today, exactly two years ago, a small company in London called DeepMind uploaded their pioneering paper “Playing Atari with Deep Reinforcement Learning” to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!

It has been hailed since then as the first step towards general artificial intelligence – an AI that can survive in a variety of environments, instead of being confined to strict realms such as  playing chess. No wonder DeepMind was immediately bought by Google and has been on the forefront of deep learning research ever since. In February 2015 their paper “Human-level control through deep reinforcement learning” was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.

Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn’t had time to delve into reinforcement learning yet.

1. What are the main challenges in reinforcement learning? We will cover the credit assignment problem and the exploration-exploitation dilemma here.
2. How to formalize reinforcement learning in mathematical terms? We will define Markov Decision Process and use it for reasoning about reinforcement learning.
3. How do we form long-term strategies? We define “discounted future reward”, that forms the main basis for the algorithms in the next sections.
4. How can we estimate or approximate the future reward? Simple table-based Q-learning algorithm is defined and explained here.
5. What if our state space is too big? Here we see how Q-table can be replaced with a (deep) neural network.
6. What do we need to make it actually work? Experience replay technique will be discussed here, that stabilizes the learning with neural networks.
7. Are we done yet? Finally we will consider some simple solutions to the exploration-exploitation problem.

# Reinforcement Learning

Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases – you get a reward.

Figure 1: Atari Breakout game. Image credit: DeepMind.

Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem – for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.

This is the task reinforcement learning tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. Based only on those rewards the agent has to learn to behave in the environment.

While the idea is quite intuitive, in practice there are numerous challenges. For example when you hit a brick and score a reward in the Breakout game, it often has nothing to do with the actions (paddle movements) you did just before getting the reward. All the hard work was already done, when you positioned the paddle correctly and bounced the ball back. This is called the credit assignment problem – i.e., which of the preceding actions were responsible for getting the reward and to what extent.

Once you have figured out a strategy to collect a certain number of rewards, should you stick with it or experiment with something that could result in even bigger rewards? In the above Breakout game a simple strategy is to move to the left edge and wait there. When launched, the ball tends to fly left more often than right and you will easily score on about 10 points before you die. Will you be satisfied with this or do you want more? This is called the explore-exploit dilemma – should you exploit the known working strategy or explore other, possibly better strategies.

Reinforcement learning is an important model of how we (and all animals in general) learn. Praise from our parents, grades in school, salary at work – these are all examples of rewards. Credit assignment problems and exploration-exploitation dilemmas come up every day both in business and in relationships. That’s why it is important to study this problem, and games form a wonderful sandbox for trying out new approaches.

# Markov Decision Process

Now the question is, how do you formalize a reinforcement learning problem, so that you can reason about it? The most common method is to represent it as a Markov decision process.

Suppose you are an agent, situated in an environment (e.g. Breakout game). The environment is in a certain state (e.g. location of the paddle, location and direction of the ball, existence of every brick and so on). The agent can perform certain actions in the environment (e.g. move the paddle to the left or to the right). These actions sometimes result in a reward (e.g. increase in score). Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. The rules for how you choose those actions are called policy. The environment in general is stochastic, which means the next state may be somewhat random (e.g. when you lose a ball and launch a new one, it goes towards a random direction).

Figure 2: Left: reinforcement learning problem. Right: Markov decision process. Image credit: Wikipedia.

The set of states and actions, together with rules for transitioning from one state to another and for getting rewards, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards:

s0,a0,r1,s1,a1,r2,s2,,sn1,an1,rn,sn

Here si represents the state, ai is the action and ri+1 is the reward after performing the action. The episode ends with terminal state sn (e.g. “game over” screen). A Markov decision process relies on the Markov assumption, that the probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.

# Discounted Future Reward

To perform well in long-term, we need to take into account not only the immediate rewards, but also the future awards we are going to get. How should we go about that?

Given one run of Markov decision process, we can easily calculate the total reward for one episode:

R=r1+r2+r3++rn

Given that, the total future reward from time point t onward can be expressed as:

Rt=rt+rt+1+rt+2++rn

But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward instead:

Rt=rt+γrt+1+γ2rt+2+γntrn

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t+1:

Rt=rt+γ(rt+1+γ(rt+2+))=rt+γRt+1

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like γ=0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor γ=1.

A good strategy for an agent would be to always choose an action, that maximizes the discounted future reward.

# Q-learning

In Q-learning we define a function Q(s,a) representing the discounted future reward when we perform action a in state s, and continue optimally from that point on.

Q(st,at)=maxπRt+1

The way to think about Q(s,a) is that it is “the best possible score at the end of game after performing action a in state s”. It is called Q-function, because it represents the “quality” of certain action in given state.

This may sound quite a puzzling definition. How can we estimate the score at the end of game, if we know just current state and action, and not the actions and rewards coming after that? We really can’t. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “Q(s,a) exists, Q(s,a)exists, …”. Feel it?

If you’re still not convinced, then consider what the implications of having such a function would be. Suppose you are in state s and pondering whether you should take action a or b. You want to select the action, that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!

π(s)=argmaxaQ(s,a)

Here π represents the policy, the rule how we choose an action in each state.

OK, how do we get that Q-function then? Let’s focus on just one transition <s,a,r,s>. Just like with discounted future rewards in previous section we can express Q-value of state s and action a in terms of Q-value of next state s.

Q(s,a)=r+γmaxaQ(s,a)

This is called the Bellman equation. If you think about it, it is quite logical – maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state.

The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of Q-learning algorithm is as simple as the following:

initialize Q[numstates,numactions] arbitrarily
observe initial state s
repeat
select and carry out an action a
observe reward r and new state s'
Q[s,a] = Q[s,a] + α(r + γmaxa' Q[s',a'] - Q[s,a])
s = s'
until terminated


α in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two Q[s,a]-s cancel and the update is exactly the same as Bellman equation.

maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it may be completely wrong. However the estimations get more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

# Deep Q Network

The state of the environment in the Breakout game can be defined by the location of the paddle, location and direction of the ball and the existence of each individual brick. This intuitive representation is however game specific. Could we come up with something more universal, that would be suitable for all the games? Obvious choice is screen pixels – they implicitly contain all of the relevant information about the game situation, except for the speed and direction of the ball. Two consecutive screens would have these covered as well.

If we would apply the same preprocessing to game screens as in the DeepMind paper – take four last screen images, resize them to 84×84 and convert to grayscale with 256 gray levels – we would have 25684×84×41067970 possible game states. This means 1067970 rows in our imaginary Q-table – that is more than the number of atoms in the known universe! One could argue that many pixel combinations and therefore states never occur – we could possibly represent it as a sparse table containing only visited states. Even so, most of the states are very rarely visited and it would take a lifetime of the universe for the Q-table to converge. Ideally we would also like to have a good guess for Q-values for states we have never seen before.

This is the point, where deep learning steps in. Neural networks are exceptionally good in coming up with good features for highly structured data. We could represent our Q-function with a neural network, that takes the state (four game screens) and action as input and outputs the corresponding Q-value. Alternatively we could take only game screens as input and output the Q-value for each possible action. This approach has the advantage, that if we want to perform a Q-value update or pick the action with highest Q-value, we only have to do one forward pass through the network and have all Q-values for all actions immediately available.

Figure 3: Left: Naive formulation of deep Q-network. Right: More optimal architecture of deep Q-network, used in DeepMind paper.

The network architecture that DeepMind used is as follows:

 Layer Input Filter size Stride Num filters Activation Output conv1 84x84x4 8×8 4 32 ReLU 20x20x32 conv2 20x20x32 4×4 2 64 ReLU 9x9x64 conv3 9x9x64 3×3 1 64 ReLU 7x7x64 fc4 7x7x64 512 ReLU 512 fc5 512 18 Linear 18

This is a classical convolutional neural network with three convolutional layers, followed by two fully connected layers. People familiar with object recognition networks may notice that there are no pooling layers. But if you really think about that, then pooling layers buy you a translation invariance – the network becomes insensitive to the location of an object in the image. That makes perfectly sense for a classification task like ImageNet, but for games the location of the ball is crucial in determining the potential reward and we wouldn’t want to discard this information!

Input to the network are four 84×84 grayscale game screens. Outputs of the network are Q-values for each possible action (18 in Atari). Q-values can be any real values, which makes it a regression task, that can be optimized with a simple squared error loss.

L=12[r+γmaxaQ(s,a)targetQ(s,a)prediction]2

Given a transition <s,a,r,s>, the Q-table update rule in the previous algorithm must be replaced with the following:

1. Do a feedforward pass for the current state s to get predicted Q-values for all actions.
2. Do a feedforward pass for the next state s and calculate maximum over all network outputs maxaQ(s,a).
3. Set Q-value target for action a to r+γmaxaQ(s,a) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.

# Experience Replay

By now we have an idea how to estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network. But it turns out that approximation of Q-values using non-linear functions is not very stable. There is a whole bag of tricks that you have to use to actually make it converge. And it takes a long time, almost a week on a single GPU.

The most important trick is experience replay. During gameplay all the experiences <s,a,r,s> are stored in a replay memory. When training the network, random samples from the replay memory are used instead of the most recent transition. This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm. One could actually collect all those experiences from human gameplay and the train network on these.

# Exploration-Exploitation

Q-learning attempts to solve the credit assignment problem – it propagates rewards back in time, until it reaches the crucial decision point which was the actual cause for the obtained reward. But we haven’t touched the exploration-exploitation dilemma yet…

Firstly observe, that when a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration”. As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases. So one could say, that Q-learning incorporates the exploration as part of the algorithm. But this exploration is “greedy”, it settles with the first effective strategy it finds.

A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value. In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate.

# Deep Q-learning Algorithm

This gives us the final deep Q-learning algorithm with experience replay:

initialize replay memory D
initialize action-value function Q with random weights
observe initial state s
repeat
select an action a
with probability ε select a random action
otherwise select a = argmaxa’Q(s,a’)
carry out action a
observe reward r and new state s’
store experience <s, a, r, s’> in replay memory D

sample random transitions <ss, aa, rr, ss’> from replay memory D
calculate target for each minibatch transition
if ss’ is terminal state then tt = rr
otherwise tt = rr + γmaxa’Q(ss’, aa’)
train the Q network using (tt - Q(ss, aa))^2 as loss

s = s'
until terminated

There are many more tricks that DeepMind used to actually make it work – like target network, error clipping, reward clipping etc, but these are out of scope for this introduction.

The most amazing part of this algorithm is that it learns anything at all. Just think about it – because our Q-function is initialized randomly, it initially outputs complete garbage. And we are using this garbage (the maximum Q-value of the next state) as targets for the network, only occasionally folding in a tiny reward. That sounds insane, how could it learn anything meaningful at all? The fact is, that it does.

# Final notes

Many improvements to deep Q-learning have been proposed since its first introduction – Double Q-learningPrioritized Experience ReplayDueling Network Architecture and extension to continuous action space to name a few. For latest advancements check out the NIPS 2015 deep reinforcement learning workshop and ICLR 2016 (search for “reinforcement” in title). But beware, that deep Q-learning has been patented by Google.

It is often said, that artificial intelligence is something we haven’t figured out yet. Once we know how it works, it doesn’t seem intelligent any more. But deep Q-networks still continue to amaze me. Watching them figure out a new game is like observing an animal in the wild – a rewarding experience by itself.

# Credits

Thanks to Ardi Tampuu, Tanel Pärnamaa, Jaan Aru, Ilya Kuzovkin, Arjun Bansal and Urs Köster for comments and suggestions on the drafts of this post.

# DEEP REINFORCEMENT LEARNING WITH NEON

This is the part 2 of my series on deep reinforcement learning. See part 1 “Demystifying Deep Reinforcement Learning” for an introduction to the topic.

The first time we read DeepMind’s paper “Playing Atari with Deep Reinforcement Learning” in our research group, we immediately knew that we wanted to replicate this incredible result. It was the beginning of 2014, cuda-convnet2 was the top-performing convolutional network implementation and RMSProp was just one slide in Hinton’s Coursera course. We struggled with debugging, learned a lot, but when DeepMind also published their code alongside their Nature paper “Human-level control through deep reinforcement learning”, we started working on their code instead.

The deep learning ecosystem has evolved a lot since then. Supposedly a new deep learning toolkit was released once every 22 days in 2015. Amongst the popular ones are both the old-timers like TheanoTorch7 and Caffe, as well as the newcomers like NeonKeras and TensorFlow. New algorithms are getting implemented within days of publishing.

At some point I realized that all the complicated parts that caused us headaches a year earlier are now readily implemented in most deep learning toolkits. And when Arcade Learning Environment – the system used for emulating Atari 2600 games – released a native Python API, the time was right for a new deep reinforcement learning implementation. Writing the main code took just a weekend, followed by weeks of debugging. But finally it worked! You can see the result here: https://github.com/tambetm/simple_dqn

I chose to base it on Neon, because it has

I tried to keep my code simple and easy to extend, while also keeping performance in mind. Currently the most notable restriction in Neon is that it only runs on the latest nVidia Maxwell GPUs, but that’s about to change soon.

In the following I will describe

1. how to install it
2. what you can do with it
3. how does it compare to others
4. and finally how to modify it

# How To Install It?

There is not much to talk about here, just follow the instructions in the README. Basically all you need are NeonArcade Learning Environment and simple_dqn itself. For trying out pre-trained models you don’t even need a GPU, because they also run on CPU, albeit slowly.

# What You Can Do With It?

## Running a Pre-trained Model

Once you have everything installed, the first thing to try out is running pre-trained models. For example to run a pre-trained model for Breakout, type:

./play.sh snapshots/breakout_77.pkl

Or if you don’t have a (Maxwell) GPU available, then you can switch to CPU:

./play.sh snapshots/breakout_77.pkl --backend cpu

You should be seeing something like this, possibly even accompanied by annoying sounds

You can switch off the AI and take control in the game by pressing “m” on keyboard – see how long you last! You can give the control back to the AI by pressing “m” again.

It is actually quite useful to slow down the gameplay by pressing “a” repeatedly – this way you can observe what is really happening in the game. Press “s” to speed the game up again.

## Recording a Video

Just as easily you can also record a video of one game (in Breakout until 5 lives are lost):

./record.sh snapshots/breakout_77.pkl

The recorded video is stored as videos/breakout_77.mov and screenshots from the game can be found in videos/breakout. You can watch an example video here:

## Training a New Model

To train a new model, you first need an Atari 2600 ROM file for the game. Once you have the ROM, save it to roms folder and run training like this:

./train.sh roms/pong.rom

As a result of training the following files are created:

• results/pong.csv contains various statistics of the training process,
• snapshots/pong_<epoch>.pkl are model snapshots after every epoch.

## Testing a Model

During training there is a testing phase after each epoch. If you would like to re-test your pre-trained model later, you can use the testing script:

./test.sh snapshots/pong_141.pkl

It prints the testing results to console. To save the results to file, add --csv_file <filename> parameter.

## Plotting Statistics

During and after training you might like to see how your model is doing. There is a simple plotting script to produce graphs from the statistics file. For example:

./plot.sh results/pong.csv

This produces the file results/pong.png.

By default it produces four plots:

1. average score,
2. number of played games,
3. average maximum Q-value of validation set states,
4. average training loss.

For all the plots you can see random baseline (where it makes sense) and result from training phase and testing phase. You can actually plot any field from the statistics file, by listing names of the fields in the --fields parameter. The default plot is achieved with --fields average_reward,meanq,nr_games,meancost. Names of the fields can be taken from the first line of the CSV file.

## Visualizing the Filters

The most exciting thing you can do with this code is to peek into the mind of the AI. For that we are going to use guided backpropagation, that comes out-of-the-box with Neon. In simplified terms, for each convolutional filter it finds an image from a given dataset that activates this filter the most. Then it performs backpropagation with respect to the input image, to see which parts of the image affect the “activeness” of that filter most. This can be seen as a form of saliency detection.

To perform filter visualization run the following:

./nvis.sh snapshots/breakout_77.pkl

It starts by playing a game to collect sample data and then performs the guided backpropagation. The results can be found in results/breakout.html and they look like this:

There are 3 convolutional layers (named 0000, 0002 and 0004) and for this post I have visualized only 2 filters from each (Feature Map 0-1). There is also more detailed file for Breakout which has 16 filters visualized. For each filter an image was chosen that activates it the most. Right image shows the original input, left image shows the guided backpropagation result. You can think of every filter as an “eye” of the AI. The left image shows where this particular “eye” was directed to, given the image on the right. You can use mouse wheel to zoom in!

Because input to our network is a sequence of 4 grayscale images, it’s not very clear how to visualize it. I made a following simplification: I’m using only the last 3 screens of a state and putting them into different RGB color channels. So everything that is gray hasn’t changed over 3 images; blue was the most recent change, then green, then red. You can easily follow the idea if you zoom in to the ball – it’s trajectory is marked by red-green-blue. It’s harder to make sense of the backpropagation result, but sometimes you can guess that the filter tracks movement – the color from one corner to another progresses from red to green to blue.

Filter visualization is an immensely useful tool, you can immediately make interesting observations.

• The first layer filters focus on abstract patterns and cannot be reliably associated with any particular object. They often activate the most on score and lives, possibly because these have many edges and corners.
• As expected, there are filters that track the ball and the paddle. There are also filters that activate when the ball is about to hit a brick or the paddle.
• Also as expected, higher layer filters have bigger receptive fields. This is not so evident in Breakout, but can be clearly seen in this file for Pong. It’s interesting that filters in different layers are more similar in Breakout than in Pong.
The guided backpropagation is implemented in Neon as a callback, called at the end of training. I made a simple wrapperthat allows using it on pre-trained models. One advantage of using the wrapper is that it incorporates guided backpropagation and visualization into one step and doesn’t need a temporary file to write the intermediate results. But for that I needed to make few modifications to the Neon code, that are stored in the nvis folder.

# How Does It Compare To Others?

There are a few other deep reinforcement learning implementations out there and it would be interesting to see how the implementation in Neon compares to them. The most well-known is the original DeepMind code published with their Nature article. Another maintained version is deep_q_rl created by Nathan Sprague from James Madison University in Virginia.

The most common metric used in deep reinforcement learning is the average score per game. To calculate it for simple_dqn I used the same evaluation procedure as in the Nature article – average scores of 30 games, played with different initial random conditions and an ε-greedy policy with ε=0.05. I didn’t bother to implement the 5 minutes per game restriction, because games in Breakout and Pong don’t last that long. For DeepMind I used the values from Nature paper. For deep_q_rl I I asked the deep-q-learning list members to provide the numbers. Their scores are not collected using exactly the same protocol (the particular number below was average of 11 games) and may be therefore considered a little bit inflated.

Another interesting measure is the number of training steps per second. DeepMind and simple_dqn report average number of steps per second for each epoch (250000 steps). deep_q_rl reports number of steps per second on the fly and I just used a perceived average of numbers flowing over the screen  . In all cases I looked at the first epoch, where exploration rate is close to 1 and the results therefore reflect more of the training speed than the prediction speed. All tests were done on nVidia Geforce Titan X.

 Implemented in Breakout average score Pong average score Training steps per second DeepMind Lua + Torch 401.2 (±26.9) 18.9 (±1.3) 330 deep_q_rl Python + Theano + Lasagne + Pylearn2 350.7 ~20 ~300 simple_dqn Python + Neon 338.0 19.7 366

As it can be seen, in terms of speed simple_dqn compares quite favorably against the others. The learning results are not on-par with DeepMind yet, but they are close enough to run interesting experiments with it.

# How Can You Modify It?

The main idea in publishing the simple_dqn code was to show how simple the implementation could actually be and that everybody can extend it to perform interesting research.

There are four main classes in the code: EnvironmentReplayMemeoryDeepQNetwork and Agent. There is also main.py, that handles parameter parsing and instantiates the above classes, and the Statistics class that implements the basic callback mechanism to keep statistics collection separate from the main loop. But the gist of the deep reinforcement learning algorithm is implemented in the aforementioned four classes.

## Environment

Environment is just a lightweight wrapper around the A.L.E Python API. It should be easy enough to add other environments besides A.L.E, for example Flappy Bird or Torcs – you just have to implement a new Environment class. Give it a try and let me know!

## ReplayMemory

Replay memory stores state transitions or experiences. Basically it’s just four big arrays for screens, actions, rewards and terminal state indicators.

Replay memory is stored as a sequence of screens, not as a sequence of states consisting of four screens. This results in a huge decrease in memory usage, without significant loss in performance. Assembling screens into states can be done fast with Numpy array slicing.

Datatype for the screen pixels is uint8, which means 1M experiences take 6.57GB – you can run it with only 8GB of memory! Default would have been float32, which would have taken ~26GB.

If you would like to implement prioritized experience replay, then this is the main class you need to change.

## DeepQNetwork

This class implements the deep Q-network. It is actually the only class that depends on Neon.

Because deep reinforcement learning handles minibatching differently, there was no reason to use Neon’s DataIteratorclass. Therefore the lower level Model.fprop() and Model.bprop() are used instead. A few suggestions for anybody attempting to do the same thing:

1. You need to call Model.initialize() after constructing the Model. This allocates the tensors for layer activations and weights in GPU memory.
2. Neon tensors have dimensions (channels, height, width, batch_size). In particular batch size is the last dimension. This data layout allows for the fastest convolution kernels.
3. After transposing the dimensions of a Numpy array to match Neon tensor requirements, you need to make a copy of that array! Otherwise the actual memory layout hasn’t changed and the array cannot be directly copied to GPU.
4. Instead of accessing single tensor elements with array indices (like tensor[i,j]), try to copy the tensor to a Numpy array as a whole with tensor.set() and tensor.get() methods. This results in less round-trips between CPU and GPU.
5. Consider doing tensor arithmetics in GPU, the Neon backend provides nice methods for that. Also note that these operations are not immediately evaluated, but stored as a OpTree. The tensor will be actually evaluated when you use it or transfer it to CPU.

If you would like to implement double Q-learning, then this is the class that needs modifications.

## Agent

Agent class just ties everything together and implements the main loop.

# Conclusion

I have shown, that using a well-featured deep learning toolkit such as Neon, implementing deep reinforcement learning for Atari video games is a breeze. Filter visualization features in Neon provide important insights into what the model has actually learned.

While not the goal on its own, computer games provide an excellent sandbox for trying out new reinforcement learning approaches. Hopefully my simple_dqn implementation provides a stepping stone for a number of people into the fascinating area of deep reinforcement learning research.

# Credits

Thanks to Ardi Tampuu, Jaan Aru, Tanel Pärnamaa, Raul Vicente, Arjun Bansal and Urs Köster for comments and suggestions on the drafts of this post.

convolutional layers implement translation invariance too, but they don’t throw away the location information the same way as pooling layers do. The feature map produced by convolutional filter still has the information where the filter matched. This opinion about pooling layers has been expressed several times by Geoffrey Hinton:
https://www.quora.com/Why-does-Geoff-Hinton-say-that-its-surprising-that-pooling-in-CNNs-work

Other than that, using convolutional layers makes sense here, because the same object (the ball for example) can occur anywhere on the screen. It would be wasteful to learn separate “ball detector” for each part of the screen. Weight sharing saves a lot of parameters in those layers and allows learning to progress much faster.

This is not the same for all use cases – for example if your task is face recognition and input to your network is pre-cropped face, then the nose doesn’t appear anywhere on the image, it is predominantly in the center. That’s why Facebook uses locally connected layers instead of convolutional layers in their DeepFace implementation.

Arcade Learning Environment has “plugin” for each game, which knows where in memory the game keeps the score. Because Atari 2600 has only 128 bytes of RAM, figuring out the memory location and writing a plugin is not that hard. List of plugins for supported games in A.L.E. can be found here:

I don’t think reinforcement learning helps you with interpretability. Yes, you can say that this particular action was chosen because it had the highest future reward. But how it estimated this future reward still depends on what function approximation technique you use. In case of deep neural networks the guided backpropagation technique (https://github.com/Lasagne/Recipes/blob/master/examples/Saliency%20Maps%20and%20Guided%20Backpropagation.ipynb) has been found useful by some.

I’m using network architecture from their Nature paper “Human-level control through deep reinforcement

I think you are referring to the earlier NIPS workshop paper. My post was based on the later Nature article:

You can check the discussion from Reddit: