Reinforcement Learning



For each action and state there is a reward.
- Initially its all 0s
- after a lot of random walks and actions and getting rewards associated to each the agent is more likely to pick more rewarding actions.
This is called Q learning

Markov Decision Process:

The mathematical framework for defining a solution in reinforcement learning scenario is called Markov Decision Process. This can be designed as:

  • Set of states, S
  • Set of actions, A
  • Reward function, R
  • Policy, π
  • Value, V

We have to take an action (A) to transition from our start state to our end state (S). In return getting rewards (R) for each action we take. Our actions can lead to a positive reward or negative reward.

The set of actions we took define our policy (π) and the rewards we get in return defines our value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have to maximize

E[r_t | π, s_t]

for all possible values of S for a time t.

Value is the total cumulative reward when you do a policy.


epsilon greedy, which is literally a greedy approach to solving the problem. First you take the greedy choices first and see which gets you to destination. After that, if you (the salesman) want to go from place A to place F again, you would always choose the same policy.

major categories
Policy based, where our focus is to find optimal policy
Value based, where our focus is to find optimal value, i.e. cumulative reward
Action based, where our focus is on what optimal actions to take at each step

I would try to cover in-depth reinforcement learning algorithms in future articles. Till then, you can refer to this paper on Reinforcement Learning: A Survey, Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore, JAIR, 1996.




































Comments