Reinforcement Learning

  • In Reinforcement learning, you deal with the processes where the agent actively interacts with the environment.
  • RL is an active learning, where the agent learns only by interacting.
  • It is a online process
  • No ground truth : there is a weak signal - Reward


  • Supervised learning, you deal with objects or datasets.
  • There is no interaction with the environment and given a dataset, you are required to predict the target.
  • no concept of agent taking actions and observing the consequence of its actions
  • supervised learning is passive learning, where the agent learns only by extracting features from a given dataset.
  • It is a batch process
  • There is a ground truth for each observation



  • Value Function vπ(s) tells how good it is for the agent to be in a given state
  • Value function indicates the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states
  • Value functions reflect the total expected reward the agent can get from that state/state-action pair.
  • Action choices are made based on values and not just on immediate rewards. At each step, the RL agent aims to maximise the value function
  • Markov state contains all necessary information that helps to predict the future.
  • Stochastic policies are inherently exploratory in nature, whereas deterministic policies are exploiting
  • model of the environment
    • probability of ending up in state s' and getting immediate reward r, after taking action a from state s
    • p(s',r|s,a)
  • Policy defines what action RL agent will take when in a given state
  • Rewards are the property of environment .
  • Stochastic policy gives a chance to explore, whereas deterministic does not
  • Stochastic policy provides an edge over deterministic by allowing to choose some less probable actions


Value Function: vπ(s)=E[Rπ(s)] :

    • vπ(s) weighted sum of the value of all the actions that can be taken in a state s
    • value functions tells the inherent value of that state
    • Rπ(s) is the total reward earned from state s following the policy π


Action-Value Function: qπ(s,a)=E[Rπ(s,a)]

    • qπ(s,a) is the value of taking an action a in state s.
    • Rπ(s,a) is the total reward earned from state s after taking an action a under the policy π
    • Rπ(s)=∑a∼π(a|s)π(a|s)Rπ(s,a) --->>> vπ(s)=∑a∼π(a|s)π(a|s)qπ(s,a)


model of the environment p(s,r|s,a)

    • vπ(s)=∑a∼π(a|s)π(a|s)qπ(s,a)
      • vπ(s) represents the value of the state s
      • vπ(s)=∑aπ(a|s)∑srp(s,r|s,a)(r+γvπ(s))
    • qπ(s,a)=∑sr p(s,r|s,a) (r+γvπ(s))
      • action-value function for an action a taken from state s is the weighted sum of total reward
      • qπ(s,a) represents the value of performing a particular action a while in state s.
      • qπ(s,a) is the expected value of future reward obtained
      • qπ(s,a)=∑srp(s,r|s,a)(r+γ∑aπ(a|s)qπ(s,a))


Model Free

qπ(s,a)=Emodel(r+γvπ(s))


Bellman Equations of Optimality

      • v(s)=∑aπ(a|s)q(s,a)
      • q(s,a)=∑srp(s,r|s,a)[r+γv(s)]
      • An optimal policy is a policy which tells what is the best action to take at every state so as to maximise the total rewards

Bellman Expectation Equations)

  • vπ(s)=∑aπ(a|s)qπ(s,a)
  • qπ(s,a)=∑srp(s,r|s,a)(r+vπ(s))


Policy Evaluation step evaluates the current policy to get the state-value functions

Policy Improvement step improves the policy basis the current value function

Model-Based Method - Dynamic Programming

  • Policy evaluation refers to the iterative computation of the value functions for a given policy.
  • Policy improvement refers to finding an improved policy given the value function for an existing policy.
  • Policy evaluation and policy improvement are essential steps of Policy Iteration.


Policy iteration has three main steps:

  1. Initialize a policy randomly
  2. Policy Evaluation/ prediction problem
    • Evaluate the policy π by calculating the state value functions vπ(s)
    • vπ(s)=∑aπ(a|s)[∑srp(s,r|s,a)[r+γvπ(s)]]
    • Measure how good that policy is by calculating the state-values vπ(s) corresponding to all the states until the state-values are converged.
    • Prediction problem is one of the subproblems in Policy Iteration, where the state-values evaluated are stored to be used later for Policy Improvement
    • when do we have the optimal policy
      • For each state: vπ(s)=vπ(s)
      • The value of a state for the improved policy should be the same for the policy we started with
  3. Policy Improvement/ control problem
    • Improve the policy by choosing the action with maximum state-action value qπ(s,a) in each state
    • Make changes in the policy you evaluated in the policy evaluation step.
    • Control Problem is another subproblem in Policy Iteration where it solves the equation recursively to arrive at optimal policy
    • π(a|s)=
            • 1: a=argmaxa[∑srp(s,r|s,a)[r+γvπ(s]]
            • 0 : otherwise


linear equations = num of state space

value of a state depend upon

    • inherent value of the state
    • policy


Value Iteration over Policy iteration

It's computationally less expensive as the entire sweep till state-value function stabilisation need not be done


In policy iteration, you complete the entire policy evaluation to get the estimates of value functions for a given a policy and then use those estimates to improve the policy. And then repeat these two steps, until you arrive at optimal state value function.


In value iteration, for every state update, you’re doing a policy improvement, i.e., updating the state value by picking the most greedy action using current estimates of state-values.




Model-Free Methods

law of large numbers says that if you take a very large sample, it will give similar results as to what you would get if you would have known the actual distribution of the samples.

Model-free methods work directly with the q-function rather than state-value functions. The agent needs to estimate the value of each action to find an optimal policy


Markov Decision Process,

dynamic programming,

Monte-Carlo methods

Q-learning.


Deep Q Learning

  • policy function

π(s|a)

  • action value function/ Q-value function

q(s,a)

  • value function / state-value function

v(s)


three classes of deep RL:

  1. Deep-Q learning - value based
  2. Policy Gradient - policy based
  3. Actor-Critic - both value and policy based


arguments to the neural network for deep Q learning are:

  • The (state, action) pair is the input - it will be a concatenation of state and action
    • s, a
  • Corresponding to each (state, action) pair, a state-action value Q(s, a) is the output - the expected reward (Q-value) for the given state, corresponding to the action you perform
    • maximum total reward = r+γ∗maxaQ(s,a)


Challenges:

  • The samples are not independent because the next state is usually highly dependent on the previous state and action
  • identical distribution
    • Towards the start, the agent has not learnt much, so the probability of an (s, a, r) triplet will be very different from when the agent has learnt much more during the end of the game.
    • identical distribution means that every time you draw a data point , the probability of getting a particular data point is the same.

States in Reinforcement Learning

    • The present and the next state are highly correlated.
    • Since the policy changes while training, the samples will come from the different distribution.

steps involved while building a deep reinforcement learning model.

  1. generate the data required for training
    • with state s and action a, environment returns the new state s and the reward r
  2. train the neural net on the generated data to get the optimal Q-value function
    • Q(s,a) := Q(s,a) + α(r + γ∗maxaQ(s,a) − Q(s,a) )
    • loss function [Q(s,a)−(r+γ∗maxaQ(s,a))]2
    • predicted value Q(s,a) and the target r + γ∗max_a Q(s,a) are changing in the training process.
    • Predicted value Q(s,a) and Target value Q(s’a) will change simultaneously
    • The predicted Q-value and the target Q-value (label) are constantly changing
    • Breaking the i.i.d nature
      • samples from replay buffer are chosen randomly
    • Policy constantly changes during the training.
    • After multiple episodes, the policy will improve and will find the best policy. The policy will become stable at that point in time.




Q-learning pseudo code is as follows:

  • Initialise replay memory D to a capacity of N (if N=2000, then it can store 2000 experiences)
  • Initialise the action-value function Q (i.e. the neural net) with random weights
  • Total number of episodes is M
  • Each episode is of length (time steps) T
  • For 1 to M episodes, do:
    • For 1 to T time steps, do:
      • Generate an experience of the form <s,a,s,r>
        • With probability epsilon, select a random action a
        • Else select a=arg maxaQ(s,a,θ)
        • Go to the next state s'
        • Set next state as the current state
        • Store the experience in the replay memory D
      • Train the model on (say) 100 samples (batch size) randomly selected from the memory
        • Randomly sample transitions (s,a,s,r) of batch size from replay buffer
        • Calculate the target (y): r+maxaQ(s,a)
        • Calculate the Q-value for this state-action pair (s,a) as predicted by the network
        • Train the model to minimize the 'squared error':
        • (Q(s, a)−y)2