References: https://github.com/dennybritz/reinforcementlearningOverviewThis repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution, each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings. All code is written in Python 3 and uses RL environments from OpenAI Gym. Advanced techniques use Tensorflow for neural network implementations. Table of ContentsList of Implemented AlgorithmsResourcesTextbooks: Classes: Talks/Tutorials: Other Projects: Selected Papers:
I. IntroductionLearning Goals Understand the Reinforcement Learning problem and how it differs from Supervised Learning
Summary Reinforcement Learning (RL) is concerned with goaldirected learning and decisionmaking.
 In RL an agent learns from experiences it gains by interacting with the environment. In Supervised Learning we cannot affect the environment.
 In RL rewards are often delayed in time and the agent tries to maximize a longterm goal. For example, one may need to make seemingly suboptimal moves to reach a winning position in a game.
 An agent interacts with the environment via states, actions and rewards.
Lectures & ReadingsRequired: Optional: N/A Exercises
II. MDPs and Bellman EquationsLearning Goals Understand the AgentEnvironment interface
 Understand what MDPs (Markov Decision Processes) are and how to interpret transition diagrams
 Understand Value Functions, ActionValue Functions, and Policy Functions
 Understand the Bellman Equations and Bellman Optimality Equations for value functions and actionvalue functions
Summary Agent & Environment Interface: At each step
t the agent receives a state S_t , performs an action A_t and receives a reward R_{t+1} . The action is chosen according to a policy function pi .  The total return
G_t is the sum of all rewards starting from time t . Future rewards are discounted at a discount rate gamma^k .  Markov property: The environment's response at time
t+1 depends only on the state and action representations at time t . The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it is and try to construct the state representation to be approximately Markov.  Markov Decision Process (MDP): Defined by a state set S, action set A and onestep dynamics
p(s',r  s,a) . If we have complete knowledge of the environment we know the transition dynamic. In practice, we often don't know the full MDP (but we know that it's some MDP).  The Value Function
v(s) estimates how "good" it is for an agent to be in a particular state. More formally, it's the expected return G_t given that the agent is in state s . v(s) = Ex[G_t  S_t = s] . Note that the value function is specific to a given policy pi .  Action Value function: q(s, a) estimates how "good" it is for an agent to be in states and take action a. Similar to the value function, but also considers the action.
 The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It can be expressed using a "backup" diagram. Bellman equations exist for both the value function and the action value function.
 Value functions define an ordering over policies. A policy
p1 is better than p2 if v_p1(s) >= v_p2(s) for all states s. For MDPs, there exist one or more optimal policies that are better than or equal to all other policies.  The optimal state value function
v*(s) is the value function for the optimal policy. Same for q*(s, a) . The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average.
Lectures & ReadingsRequired: ExercisesThis chapter is mostly theory so there are no exercises.
III. ModelBased RL: Policy and Value Iteration using Dynamic ProgrammingLearning Goals Understand the difference between Policy Evaluation and Policy Improvement and how these processes interact
 Understand the Policy Iteration Algorithm
 Understand the Value Iteration Algorithm
 Understand the Limitations of Dynamic Programming Approaches
Summary Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway.
 Policy Evaluation: Calculates the statevalue function
V(s) for a given policy. In DP this is done using a "full backup". At each state, we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.  Full backups are basically the Bellman equations turned into updates.
 Policy Improvement: Given the correct statevalue function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal.
 Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy.
 Value Iteration: Instead of doing multiple steps of Policy Evaluation to find the "correct" V(s) we only do a single step and improve the policy immediately. In practice, this converges faster.
 Generalized Policy Iteration: The process of iteratively doing policy evaluation and improvement. We can pick different algorithms for each of these steps but the basic idea stays the same.
 DP methods bootstrap: They update estimates based on other estimates (one step ahead).
Lectures & ReadingsRequired:  David Silver's RL Course Lecture 3  Planning by Dynamic Programming (video, slides)
Optional: ExercisesImplement Policy Evaluation in Python (Gridworld) Implement Policy Iteration in Python (Gridworld) Implement Value Iteration in Python (Gridworld) Implement Gambler's Problem
IV. ModelFree Prediction & Control with Monte Carlo (MC)Learning Goals Understand the difference between Prediction and Control
 Know how to use the MC method for predicting state values and stateaction values
 Understand the onpolicy firstvisit MC control algorithm
 Understand offpolicy MC control algorithms
 Understand Weighted Importance Sampling
 Understand the benefits of MC algorithms over the Dynamic Programming approach
Summary Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice, we often don't have full knowledge of how the world works.
 Monte Carlo (MC) methods can learn directly from experience collected by interacting with the environment. An episode of experience is a series of
(State, Action, Reward, Next State) tuples.  MC methods work based on episodes. We sample episodes of experience and make updates to our estimates at the end of each episode. MC methods have high variance (due to lots of random decisions within an episode) but are unbiased.
 MC Policy Evaluation: Given a policy, we want to estimate the statevalue function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the actionvalue function Q(s, a). Given enough samples, this is proven to converge.
 MC Control: Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily. The Problem: How do we ensure that we explore all states if we don't know the full environment?
 Solution to exploration problem: Use epsilongreedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilongreedy policy.
 OffPolicy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilongreedy) policy? We can use importance sampling, which weighs returns by their probability of occurring under the policy we want to learn about.
Lectures & ReadingsRequired: Optional:  David Silver's RL Course Lecture 4  ModelFree Prediction (video, slides)
 David Silver's RL Course Lecture 5  ModelFree Control (video, slides)
Exercises Get familiar with the Blackjack environment (Blackjackv0)
 Implement the Monte Carlo Prediction to estimate stateaction values
 Implement the onpolicy firstvisit Monte Carlo Control algorithm
 Implement the offpolicy everyvisit Monte Carlo Control using Weighted Important Sampling algorithm
V. ModelFree Prediction & Control with Temporal Difference (TD) and QLearningLearning Goals Understand TD(0) for prediction
 Understand SARSA for onpolicy control
 Understand QLearning for offpolicy control
 Understand the benefits of TD algorithms over MC and DP approaches
 Understand how nstep methods unify MC and TD approaches
 Understand the backward and forward view of TDLambda
Summary TDLearning is a combination of Monte Carlo and Dynamic Programming ideas. Like Monte Carlo, TD works based on samples and doesn't require a model of the environment. Like Dynamic Programming, TD uses bootstrapping to make updates.
 Whether MC or TD is better depends on the problem and there are no theoretical results that prove a clear winner.
 General Update Rule:
Q[s,a] += learning_rate * (td_target  Q[s,a]) . td_target  Q[s,a] is also called the TD Error.  SARSA: OnPolicy TD Control
 TD Target for SARSA:
R[t+1] + discount_factor * Q[next_state][next_action]  QLearning: Offpolicy TD Control
 TD Target for QLearning:
R[t+1] + discount_factor * max(Q[next_state])  QLearning has a positive bias because it uses the maximum of estimated Q values to estimate the maximum action value, all from the same experience. Double QLearning gets around this by splitting the experience and using different Q functions for maximization and estimation.
 NStep methods unify MC and TD approaches. They making updates based on nsteps instead of a single step (TD0) or a full episode (MC).
Lectures & ReadingsRequired: Optional: ExercisesVI. Function ApproximationLearning Goals Understand the motivation for Function Approximation over Table Lookup
 Understand how to incorporate function approximation into existing algorithms
 Understand convergence properties of function approximators and RL algorithms
 Understand batching using experience replay
Summary Building a big table, one value for each state or stateaction pair, is memory and datainefficient. Function Approximation can generalize to unseen states by using a featurized state representation.
 Treat RL as supervised learning problem with the MC or TDtarget as the label and the current state/action as the input. Often the target also depends on the function estimator but we simply ignore its gradient. That's why these methods are called semigradient methods.
 Challenge: We have nonstationary (policy changes, bootstrapping) and noniid (correlated in time) data.
 Many methods assume that our action space is discrete because they rely on calculating the argmax over all actions. Large and continuous action spaces are ongoing research.
 For Control very few convergence guarantees exist. For nonlinear approximators there are basically no guarantees at all. But they tend to work in practice.
 Experience Replay: Store experience as dataset, randomize it, and repeatedly apply minibatch SGD.
 Tricks to stabilize nonlinear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step.
 For the nonepisodic (continuing) case function approximation is more complex and we need to give up discounting and use an "average reward" formulation.
Lectures & ReadingsRequired: Optional: ExercisesVII. Deep QLearningLearning Goals Understand the Deep QLearning (DQN) algorithm
 Understand why Experience Replay and a Target Network are necessary to make Deep QLearning work in practice
 (Optional) Understand Double Deep QLearning
 (Optional) Understand Prioritized Experience Replay
Summary DQN: QLearning but with a Deep Neural Network as a function approximator.
 Using a nonlinear Deep Neural Network is powerful, but training is unstable if we apply it naively.
 Trick 1  Experience Replay: Store experience
(S, A, R, S_next) in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.  Trick 2  Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).
 By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at humanlike performance.
 Double DQN: Just like regular QLearning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates.
Lectures & ReadingsRequired: Optional: Deep Learning: ExercisesVIII. Policy Gradient MethodsLearning Goals Understand the difference between valuebased and policybased Reinforcement Learning
 Understand the REINFORCE Algorithm (Monte Carlo Policy Gradient)
 Understand ActorCritic (AC) algorithms
 Understand Advantage Functions
 Understand Deterministic Policy Gradients (Optional)
 Understand how to scale up Policy Gradient methods using asynchronous actorcritic and Neural Networks (Optional)
Summary Idea: Instead of parameterizing the value function and doing greedy policy improvement we parameterize the policy and do gradient descent into a direction that improves it.
 Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
 Policy Score Function
J(theta) : Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.  Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
 Policy Gradient Theorem:
grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)] . Basically, we move our policy into a direction of more reward.  REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return
g_t form an episode for Q(s, a) to make an update. Unbiased but high variance.  Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as
Q(s, a)  V(s) . We use that for our policy update, e.g. g_t  V(s) for REINFORCE.  ActorCritic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients.
 A good estimate of the advantage function in the ActorCritic algorithm is the td error. Our update then becomes
grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error] .  Can use policy gradients with tdlambda, eligibility traces, and so on.
 Deterministic Policy Gradients: Useful for highdimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the actionvalue function. To ensure exploration we can use an offpolicy actorcritic algorithm with added noise in action selection.
 Deep Deterministic Policy Gradients: Apply tricks from DQN to Deterministic Policy Gradients ;)
 Asynchronous Advantage ActorCritic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic.
Lectures & ReadingsRequired:  David Silver's RL Course Lecture 7  Policy Gradient Methods (video, slides)
Optional: Exercises REINFORCE with Baseline
 ActorCritic with Baseline
 ActorCritic with Baseline for Continuous Action Spaces
 Deterministic Policy Gradients for Continuous Action Spaces (WIP)
 Deep Deterministic Policy Gradients (WIP)
 Asynchronous Advantage ActorCritic (A3C)
