Trang chủ‎ > ‎IT‎ > ‎Reinforcement Learning‎ > ‎

### dennybritz RL Tutorial

Referenceshttps://github.com/dennybritz/reinforcement-learning

### Overview

This repository provides code, exercises and solutions for popular Reinforcement Learning algorithms. These are meant to serve as a learning tool to complement the theoretical materials from

Each folder in corresponds to one or more chapters of the above textbook and/or course. In addition to exercises and solution, each folder also contains a list of learning goals, a brief concept summary, and links to the relevant readings.

All code is written in Python 3 and uses RL environments from OpenAI Gym. Advanced techniques use Tensorflow for neural network implementations.

Textbooks:

Classes:

Talks/Tutorials:

Other Projects:

Selected Papers:

## I. Introduction

### Learning Goals

• Understand the Reinforcement Learning problem and how it differs from Supervised Learning

### Summary

• Reinforcement Learning (RL) is concerned with goal-directed learning and decision-making.
• In RL an agent learns from experiences it gains by interacting with the environment. In Supervised Learning we cannot affect the environment.
• In RL rewards are often delayed in time and the agent tries to maximize a long-term goal. For example, one may need to make seemingly suboptimal moves to reach a winning position in a game.
• An agent interacts with the environment via states, actions and rewards.

Required:

Optional:

N/A

## II. MDPs and Bellman Equations

### Learning Goals

• Understand the Agent-Environment interface
• Understand what MDPs (Markov Decision Processes) are and how to interpret transition diagrams
• Understand Value Functions, Action-Value Functions, and Policy Functions
• Understand the Bellman Equations and Bellman Optimality Equations for value functions and action-value functions

### Summary

• Agent & Environment Interface: At each step `t` the agent receives a state `S_t`, performs an action `A_t` and receives a reward `R_{t+1}`. The action is chosen according to a policy function `pi`.
• The total return `G_t` is the sum of all rewards starting from time t . Future rewards are discounted at a discount rate `gamma^k`.
• Markov property: The environment's response at time `t+1` depends only on the state and action representations at time `t`. The future is independent of the past given the present. Even if an environment doesn't fully satisfy the Markov property we still treat it as if it is and try to construct the state representation to be approximately Markov.
• Markov Decision Process (MDP): Defined by a state set S, action set A and one-step dynamics `p(s',r | s,a)`. If we have complete knowledge of the environment we know the transition dynamic. In practice, we often don't know the full MDP (but we know that it's some MDP).
• The Value Function `v(s)` estimates how "good" it is for an agent to be in a particular state. More formally, it's the expected return `G_t` given that the agent is in state `s``v(s) = Ex[G_t | S_t = s]`. Note that the value function is specific to a given policy `pi`.
• Action Value function: q(s, a) estimates how "good" it is for an agent to be in states and take action a. Similar to the value function, but also considers the action.
• The Bellman equation expresses the relationship between the value of a state and the values of its successor states. It can be expressed using a "backup" diagram. Bellman equations exist for both the value function and the action value function.
• Value functions define an ordering over policies. A policy `p1` is better than `p2` if `v_p1(s) >= v_p2(s)` for all states s. For MDPs, there exist one or more optimal policies that are better than or equal to all other policies.
• The optimal state value function `v*(s)` is the value function for the optimal policy. Same for `q*(s, a)`. The Bellman Optimality Equation defines how the optimal value of a state is related to the optimal value of successor states. It has a "max" instead of an average.

Required:

### Exercises

This chapter is mostly theory so there are no exercises.

## III. Model-Based RL: Policy and Value Iteration using Dynamic Programming

### Learning Goals

• Understand the difference between Policy Evaluation and Policy Improvement and how these processes interact
• Understand the Policy Iteration Algorithm
• Understand the Value Iteration Algorithm
• Understand the Limitations of Dynamic Programming Approaches

### Summary

• Dynamic Programming (DP) methods assume that we have a perfect model of the environment's Markov Decision Process (MDP). That's usually not the case in practice, but it's important to study DP anyway.
• Policy Evaluation: Calculates the state-value function `V(s)` for a given policy. In DP this is done using a "full backup". At each state, we look ahead one step at each possible action and next state. We can only do this because we have a perfect model of the environment.
• Full backups are basically the Bellman equations turned into updates.
• Policy Improvement: Given the correct state-value function for a policy we can act greedily with respect to it (i.e. pick the best action at each state). Then we are guaranteed to improve the policy or keep it fixed if it's already optimal.
• Policy Iteration: Iteratively perform Policy Evaluation and Policy Improvement until we reach the optimal policy.
• Value Iteration: Instead of doing multiple steps of Policy Evaluation to find the "correct" V(s) we only do a single step and improve the policy immediately. In practice, this converges faster.
• Generalized Policy Iteration: The process of iteratively doing policy evaluation and improvement. We can pick different algorithms for each of these steps but the basic idea stays the same.
• DP methods bootstrap: They update estimates based on other estimates (one step ahead).

Required:

• David Silver's RL Course Lecture 3 - Planning by Dynamic Programming (videoslides)

Optional:

### Exercises

• Implement Policy Evaluation in Python (Gridworld)

• Implement Policy Iteration in Python (Gridworld)

• Implement Value Iteration in Python (Gridworld)

• Implement Gambler's Problem

## IV. Model-Free Prediction & Control with Monte Carlo (MC)

### Learning Goals

• Understand the difference between Prediction and Control
• Know how to use the MC method for predicting state values and state-action values
• Understand the on-policy first-visit MC control algorithm
• Understand off-policy MC control algorithms
• Understand Weighted Importance Sampling
• Understand the benefits of MC algorithms over the Dynamic Programming approach

### Summary

• Dynamic Programming approaches assume complete knowledge of the environment (the MDP). In practice, we often don't have full knowledge of how the world works.
• Monte Carlo (MC) methods can learn directly from experience collected by interacting with the environment. An episode of experience is a series of `(State, Action, Reward, Next State)` tuples.
• MC methods work based on episodes. We sample episodes of experience and make updates to our estimates at the end of each episode. MC methods have high variance (due to lots of random decisions within an episode) but are unbiased.
• MC Policy Evaluation: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples, this is proven to converge.
• MC Control: Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily. The Problem: How do we ensure that we explore all states if we don't know the full environment?
• Solution to exploration problem: Use epsilon-greedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilon-greedy policy.
• Off-Policy Learning: How can we learn about the actual optimal (greedy) policy while following an exploratory (epsilon-greedy) policy? We can use importance sampling, which weighs returns by their probability of occurring under the policy we want to learn about.

Required:

Optional:

• David Silver's RL Course Lecture 4 - Model-Free Prediction (videoslides)
• David Silver's RL Course Lecture 5 - Model-Free Control (videoslides)

### Exercises

• Get familiar with the Blackjack environment (Blackjack-v0)
• Implement the Monte Carlo Prediction to estimate state-action values
• Implement the on-policy first-visit Monte Carlo Control algorithm
• Implement the off-policy every-visit Monte Carlo Control using Weighted Important Sampling algorithm

## V. Model-Free Prediction & Control with Temporal Difference (TD) and Q-Learning

### Learning Goals

• Understand TD(0) for prediction
• Understand SARSA for on-policy control
• Understand Q-Learning for off-policy control
• Understand the benefits of TD algorithms over MC and DP approaches
• Understand how n-step methods unify MC and TD approaches
• Understand the backward and forward view of TD-Lambda

### Summary

• TD-Learning is a combination of Monte Carlo and Dynamic Programming ideas. Like Monte Carlo, TD works based on samples and doesn't require a model of the environment. Like Dynamic Programming, TD uses bootstrapping to make updates.
• Whether MC or TD is better depends on the problem and there are no theoretical results that prove a clear winner.
• General Update Rule: `Q[s,a] += learning_rate * (td_target - Q[s,a])``td_target - Q[s,a]` is also called the TD Error.
• SARSA: On-Policy TD Control
• TD Target for SARSA: `R[t+1] + discount_factor * Q[next_state][next_action]`
• Q-Learning: Off-policy TD Control
• TD Target for Q-Learning: `R[t+1] + discount_factor * max(Q[next_state])`
• Q-Learning has a positive bias because it uses the maximum of estimated Q values to estimate the maximum action value, all from the same experience. Double Q-Learning gets around this by splitting the experience and using different Q functions for maximization and estimation.
• N-Step methods unify MC and TD approaches. They making updates based on n-steps instead of a single step (TD-0) or a full episode (MC).

Required:

Optional:

## VI. Function Approximation

### Learning Goals

• Understand the motivation for Function Approximation over Table Lookup
• Understand how to incorporate function approximation into existing algorithms
• Understand convergence properties of function approximators and RL algorithms
• Understand batching using experience replay

### Summary

• Building a big table, one value for each state or state-action pair, is memory- and data-inefficient. Function Approximation can generalize to unseen states by using a featurized state representation.
• Treat RL as supervised learning problem with the MC- or TD-target as the label and the current state/action as the input. Often the target also depends on the function estimator but we simply ignore its gradient. That's why these methods are called semi-gradient methods.
• Challenge: We have non-stationary (policy changes, bootstrapping) and non-iid (correlated in time) data.
• Many methods assume that our action space is discrete because they rely on calculating the argmax over all actions. Large and continuous action spaces are ongoing research.
• For Control very few convergence guarantees exist. For non-linear approximators there are basically no guarantees at all. But they tend to work in practice.
• Experience Replay: Store experience as dataset, randomize it, and repeatedly apply minibatch SGD.
• Tricks to stabilize non-linear function approximators: Fixed Targets. The target is calculated based on frozen parameter values from a previous time step.
• For the non-episodic (continuing) case function approximation is more complex and we need to give up discounting and use an "average reward" formulation.

Required:

Optional:

### Exercises

• Get familiar with the Mountain Car Playground

• Solve Mountain Car Problem using Q-Learning with Linear Function Approximation

## VII. Deep Q-Learning

### Learning Goals

• Understand the Deep Q-Learning (DQN) algorithm
• Understand why Experience Replay and a Target Network are necessary to make Deep Q-Learning work in practice
• (Optional) Understand Double Deep Q-Learning
• (Optional) Understand Prioritized Experience Replay

### Summary

• DQN: Q-Learning but with a Deep Neural Network as a function approximator.
• Using a non-linear Deep Neural Network is powerful, but training is unstable if we apply it naively.
• Trick 1 - Experience Replay: Store experience `(S, A, R, S_next)` in a replay buffer and sample minibatches from it to train the network. This decorrelates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.
• Trick 2 - Target Network: Use a separate network to estimate the TD target. This target network has the same architecture as the function approximator but with frozen parameters. Every T steps (a hyperparameter) the parameters from the Q network are copied to the target network. This leads to more stable training because it keeps the target function fixed (for a while).
• By using a Convolutional Neural Network as the function approximator on raw pixels of Atari games where the score is the reward we can learn to play many of those games at human-like performance.
• Double DQN: Just like regular Q-Learning, DQN tends to overestimate values due to its max operation applied to both selecting and estimating actions. We get around this by using the Q network for selection and the target network for estimation when making updates.

Required:

Optional:

Deep Learning:

### Learning Goals

• Understand the difference between value-based and policy-based Reinforcement Learning
• Understand the REINFORCE Algorithm (Monte Carlo Policy Gradient)
• Understand Actor-Critic (AC) algorithms
• Understand Deterministic Policy Gradients (Optional)
• Understand how to scale up Policy Gradient methods using asynchronous actor-critic and Neural Networks (Optional)

### Summary

• Idea: Instead of parameterizing the value function and doing greedy policy improvement we parameterize the policy and do gradient descent into a direction that improves it.
• Sometimes the policy is easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
• Policy Score Function `J(theta)`: Intuitively, it measures how good our policy is. For example, we can use the average value or average reward under a policy as our objective.
• Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
• Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward.
• REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` form an episode for Q(s, a) to make an update. Unbiased but high variance.
• Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE.
• Actor-Critic: Instead of waiting until the end of an episode as in REINFORCE we use bootstrapping and make an update at each step. To do that we also train a Critic Q(theta) that approximates the value function. Now we have two function approximators: One of the policy, one for the critic. This is basically TD, but for Policy Gradients.
• A good estimate of the advantage function in the Actor-Critic algorithm is the td error. Our update then becomes `grad(J(theta)) = Ex[grad(log(pi(s, a))) * td_error]`.
• Can use policy gradients with td-lambda, eligibility traces, and so on.
• Deterministic Policy Gradients: Useful for high-dimensional continuous action spaces where stochastic policy gradients are expensive to compute. The idea is to update the policy in the direction of the gradient of the action-value function. To ensure exploration we can use an off-policy actor-critic algorithm with added noise in action selection.
• Deep Deterministic Policy Gradients: Apply tricks from DQN to Deterministic Policy Gradients ;)
• Asynchronous Advantage Actor-Critic (A3C): Instead of using an experience replay buffer as in DQN use multiple agents on different threads to explore the state spaces and make decorrelated updates to the actor and the critic.

Required:

• David Silver's RL Course Lecture 7 - Policy Gradient Methods (videoslides)

Optional:

### Exercises

• REINFORCE with Baseline
• Actor-Critic with Baseline
• Actor-Critic with Baseline for Continuous Action Spaces
• Deterministic Policy Gradients for Continuous Action Spaces (WIP)
• Deep Deterministic Policy Gradients (WIP)