Reinforcement Learning

Syllabus:

Course Pre-requisites: Probability and Linear Algebra (Basics), Programming Knowledge (preferably Python), Data Structures and Algorithms, Artificial Intelligence, Machine Learning and (Deep) Neural Networks.

Books and References:

Richard S. Sutton and Andrew G. Barto; Reinforcement Learning: An Introduction; 2nd Edition, MIT Press, 2020

U Dinesh Kumar, Business Analytics: The Science of Data(Driven Decision Making), Wiley publication, 1st Edition 2017.

Yuxi Li; Deep Reinforcement Learning: An Overview; ArXiv ePrint, 2018.

David Silver, Lecture Resource on Introduction of Deep reinforcement Learning, (deepmind site)

Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin Puterman
Stochastic Approximation: A Dynamical Systems Viewpoint by Vivek Borkar
Neuro-Dynamic Programming by Dimitri Bertsekas and John Tsitsiklis
Markov Chains and Mixing Times by David Asher Levin, Elizabeth Wilmer, and Yuval Peres

Theses:

Safe Reinforcement Learning by Philip Thomas
Breaking the Deadly Triad in Reinforcement Learning by Shangtong Zhang
Actor-Critic Algorithms by Vijaymohan Konda

Notes:

Introduction to discrete-time Markov chains I by Karl Sigman
Markov chains II: recurrence and limiting (stationary) distributions by Karl Sigman

Unit-1 Introduction: Course logistics and ov erv iew. Origin and history of Reinforcement Learning research. Its connection s with other related fields and with different branches of machine learning. Pr obability Primer Brush up of Probability concepts - Axioms of pr obability , concepts of random v ariables, PMF, PDFs, CDFs, Expectation. Concepts of joint and multiple random variables, joint, con ditional and marginal distributions. Correlation and independence.

Unit-2 Markov Decision Pr ocess: Intr oduction to RL terminology , Markov pr operty , Markov chains, Markov reward pr ocess (MRP). Intr oduction to and pr oof of Bellman equations for MRPs a long with proof of existence of solution to Bellman equation s in MRP. Intr oduction to Markov decision pr ocess (MDP), state and action v alue functions, Bellman expectation equations, optimality of value functions and policies, Bellman optimality equations.

Unit-3 Prediction and Control by Dy namic Pr ogramming: Ov erv iew of dy namic pr ograming for MDP, definition and formulation of planning in MDPs, principle of optimality , iterativ e policy ev aluation, policy iteration, v alue iteration, Banach fixed point theorem, proof of contraction mapping pr operty of Bellman expectation and optimality operator s, proof of conv ergence of policy ev aluation and v alue iteration alg orithms, DP extensions. Monte Carlo Meth ods for Model Free Prediction and Control Ov erv iew of Monte Carlo methods for model free RL, Fir st v isit and ev ery v isit Monte Carlo, Monte Carlo control, On policy and off policy learning, Importance sampling.

Unit-4 Function Approximation Methods: Function approximation methods, Revisiting risk minimization, gradient descent from Machine Learning, Gradient MC and Semi-gradient TD(0) algorithms, Eligibility trace for function approximation, After states, Control with function approximation, Lea t squares, Experience replay in deep Q-Networks. Policy Gradients Getting started with policy gradient methods, Log -derivative trick, Naive Reinforce algorithm, bias and variance in Reinforcement Learning, Reducing variance in policy gradient estimates, baselines, advantage function, actor -critic methods

Reinforcement Learning Interview Questions

OPTED FROM

https://www.mlstack.cafe/interview-questions/reinforcement-learning

Q1: What is Reinforcement Learning? How does it compare with other ML techniques?

Q2: How to define States in Reinforcement Learning?

Q3: Name some approaches or algorithms you know in to solve a problem in Reinforcement Learning

Q4: Provide an intuitive explanation of what is a Policy in Reinforcement learning

Q5: What are the steps involved in a typical Reinforcement Learning algorithm?

Q6: What is Markov Decision Process?

Q7: What is the difference between Off-Policy and On-Policy Learning?

Q8: What is the difference between a Reward and a Value for a given State?

Q9: What is the role of the Discount Factor in Reinforcement Learning?

Q10: Are there any problems when using the Epsilon-Greedy method to find the Optimal Policy?

Q11: Can the Monte Carlo Method be applicable to all tasks?

Q12: Can you think of an example of an Epsilon-Greedy Policy in real life?

Q13: Compare Reinforced Learning and Supervised Learning

Q14: How does the Monte Carlo prediction method compute the Value Function?

Q15: How to choose the values of Gamma and Lambda in generalised temporal differencing algorithms?

Q16: Name some advantages of using Temporal difference vs Monte Carlo methods for Reinforcement Learning

Q17: What type of Neural Networks do Deep Reinforcement Learning use?

Q18: What types of Reinforcement Learning Environments do you know?

Q19: What's the difference between Q-Learning and Policy Gradients methods?

Q20: What's the difference between a Deterministic vs Stochastic policy?

Q21: Why would you use a Deep Q-Network?

Q22: Why would you use a Policy-based method instead o a Value-based method?

****************************

Q1: What is Reinforcement Learning? How does it compare with other ML techniques?

Q2: What is Markov Decision Process?

Q3: Provide an intuitive explanation of what is a Policy in Reinforcement learning

Q4: What is the role of the Discount Factor in Reinforcement Learning?

Q5: Name some approaches or algorithms you know in to solve a problem in Reinforcement Learning

Q6: How to define States in Reinforcement Learning? Related To: Q-Learning

Q7: What is the difference between a Reward and a Value for a given State?

Q8: How do you know when a Q-Learning Algorithm converges? Related To: Q-Learning

Q9: What does a Stationary Dynamics and Stationary Policy mean in the context of Reinforcement Learning?

Q10: What are the steps involved in a typical Reinforcement Learning algorithm?

Q11: What is the difference between Off-Policy and On-Policy Learning?

Q12: What do the Alpha and Gamma parameters represent in Q Learning? Related To: Q-Learning

Q13: What type of Neural Networks do Deep Reinforcement Learning use? Related To: Neural Networks

Q14: Compare Reinforced Learning and Supervised Learning Related To: Supervised Learning

Q15: What's the difference between a Deterministic vs Stochastic policy?

Q16: How does the Q function differ from the Value function in Reinforcement Learning?

Q17: What is the difference between Q-Learning and SARSA and when would you use each one? Related To: Q-Learning

Q18: Can you think of an example of an Epsilon-Greedy Policy in real life?

Q19: What types of Reinforcement Learning Environments do you know?

Q20: What's the advantage of using Policy Iteration vs Value iteration?

Q21: Can the Monte Carlo Method be applicable to all tasks?

19: What types of Reinforcement Learning Environments do you know?

Q20: What's the advantage of using Policy Iteration vs Value iteration?

Q21: Can the Monte Carlo Method be applicable to all tasks? Related To: Monte Carlo Method

Q22: How to distinguish Episodic Tasks vs Continuous Tasks?

Q23: How does the Monte Carlo prediction method compute the Value Function? Related To: Monte Carlo Method

Q24: What types of Monte Carlo Prediction Algorithms do you know? Related To: Monte Carlo Method

Q25: Name some advantages of using Temporal difference vs Monte Carlo methods for Reinforcement Learning Related To: Monte Carlo Method

Q26: Name some advantages of using Monte Carlo vs Dynamic Programming methods in Reinforcement Learning Related To: Monte Carlo Method

Q27: Why would you use a Policy-based method instead o a Value-based method?

Q28: Why would you use a Deep Q-Network?

Q29: What is the difference between episode and epoch in Deep Q-Learning? Related To: Q-Learning

Q30: Are there any problems when using REINFORCE to obtain the optimal policy?

Q31: What's the difference between Learning Rate Decay and Epsilon Decay? What is the context of each one?

Q32: Are there any problems when using the Epsilon-Greedy method to find the Optimal Policy?

Q33: What's the difference between a Deep Q-Network and a categorical Deep Q-Network? Related To: Q-Learning

Q34: How to choose the values of Gamma and Lambda in generalised temporal differencing algorithms?

Q35: Can Q-learning be used for continuous (state or action) spaces? If not, then what would you use? Related To: Q-Learning

Q36: What's the difference between Q-Learning and Policy Gradients methods? Related To: Q-Learning

Q37: Can you apply Value Iteration and Policy Iteration in any environment?

Q38: What's the difference between Deep Q-Learning and Policy Gradient Method? Related To: Q-Learning

Q39: What is Sample Efficiency, and how can Importance Sampling be used to achieve it?

Q40: What is the difference between vanilla policy gradient (VPG) with a baseline as value function and advantage actor-critic (A2C)?

Q41: What are some best practices when trying to design a Reward Function?

Q42: How can policy gradients be applied in the case of multiple continuous actions?

Q43: When would you use a Deep Recurrent Q-Network? Related To: Deep Learning

Q44: Is the optimal policy always Stochastic if the environment is also Stochastic?

Q45: How does a Double Deep Q-Network differ from a Deep Q-Network?

Q46: Are there any problems when using a Softmax Function to select actions in a Deep Q-Network?

Q47: Why do we need the target network in a Deep Q-Network? Related To: Q-Learning

Q48: What are some advantages of Quantile Regression DQN over Categorical DQN? Related To: Q-Learning

Q49: What is the effect of Parallel Environments in Reinforcement Learning?

Q50: How does the Actor-Critic method differ from the Policy Gradient with the Baseline method?

Q51: What is Experience Replay and what are its benefits?

Q52: Why do regular Q-Learning and DQN overestimate the Q values?

Q53: What's the difference between Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C)? Related To: Q-Learning

Q54: Can SARSA be used in a Partially Observable Markov Decision Process? If yes (or not), why?

Assignment 1

Assignment 2