DA671-2024
DA 671: Introduction to Reinforcement Learning (Jan 2024)
The course is open to Ph.D. and M.Tech.
Lectures:
Mon: 5 PM - 5.55 PM, Tue: 4 PM-4.55 PM Wed: 3 PM -3.55 PM
Venue: 5002 (Core 5)
Instructor:
Dr. Arghyadip Roy, Assistant Professor, MFSDS&AI, IIT Guwahati
Course Objective:
The course covers the fundamentals of Reinforcement Learning. The course starts with an introduction to Markov chains and Markov decision process as a prelude to Reinforcement Learning. This course is designed to provide theoretical as well as application-oriented perspectives. Case studies and examples covered will include many real-life applications from various domains. It is expected that the students will be able to apply Reinforcement Learning, the main take-away of this course, in their pursuits in machine learning and diverse application areas.
Course Content/ Syllabus:
Introduction to Reinforcement Learning (RL); Markov Chains: Discrete-time Markov chains, Stationary Distribution, Continuous-time Markov Chains. Markov Decision Process (MDP): Terminologies & Fundamentals; Finite Horizon MDP: Dynamic Programming (DP), Bellman Equation; Infinite Horizon Discounted Cost Problems: DP Equation, Fixed Point & Contraction Mapping, Value Iteration (VI), Policy Iteration (PI), Linear Programming formulation; RL Techniques: Multi-armed Bandit; Exploration & Exploitation, Optimism in the Face of Uncertainty, Upper Confidence Bound, Thompson Sampling; Monte-Carlo Methods: First-visit and Every-visit, Monte-Carlo Control, Off-policy Monte Carlo Control; Temporal Difference Learning: TD Prediction, Optimality of TD(0), SARSA, Q-learning, n-step TD Prediction, TD (lambda) algorithm, Linear Function Approximation, Linear TD (lambda); Policy Search Methods: Policy Gradient (PG) Theorem, REINFORCE, Variance Reduction in PG Methods, Actor-critic Methods, Batch RL.
Texts:
M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st Edition, Wiley, 2005.
R.S. Sutton and A.G. Barto, Reinforcement Learning: An introduction, 2nd Edition, MIT Press, 2018.
References:
D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, 3rd Edition, Athena Scientific, 2005.
D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming, 4th Edition, Athena Scientific, 2012.
T. Lattimore and C. Szepesvári, Bandit Algorithms, 1st Edition, Cambridge University Press, 2020.
M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action, 1st Edition, Cambridge University Press, 2013.
Some more references: Lecture Notes by Lodewijk Kallenberg (Markov Decision Process), Shivaram Kalyanakrishnan (Foundations of Intelligent and Learning Agents), R. Srikant (Control of Stochastic Systems), Daniel J. Russo , Benjamin Van Roy , Abbas Kazerouni , Ian Osband and Zheng Wen (A Tutorial on Thompson Sampling) and Richard Weber (Optimization & Control).
Grading Scheme:
2 Quizes:20%
Midsem: 20%
Endsem: 40%
Course project (Modeling + Programing): 20%
Students auditing the course must score above 50% in the course to be awarded an audit grade.
Lecture Notes:
Lecture 1-2: Introduction to Reinforcement Learning
Lecture 3: Probability Review & Discrete Time Markov Chain
Lecture 4: Stationary Distribution
Lecture 5: Ergodicity Theory
Lecture 6: Poisson Process & Exponential Distribution
Lecture 7: Continuous Time Markov Chain
Lecture 8: Markov Decision Process (MDP) Basics
Lecture 9: Finite Horizon Dynamic Programming
Lecture 10: Finite Horizon Stochastic Control
Lecture 11: Examples of Stochastic Dynamic Programming
Lecture 12: Infinite Horizon Discounted Cost MDP basics
Lecture 13: Existence of Optimal Policy
Lecture 14: Contraction mapping
Lecture 15: Value Iteration Algorithm
Lecture 16: Policy Iteration Algorithm
Lecture 17: Threshold Policy
Lecture 18: Multi-armed Bandit Basics
Lecture 19: Bandit Algorithms
Lecture 20: Upper Confidence Bound (UCB) Basics
Lecture 21: UCB Regret Analysis and KL-UCB Basics
Lecture 22: Thompson Sampling
Lecture 23: Monte Carlo Methods
Lecture 24: Off-policy Monte Carlo Methods
Lecture 25: TD Learning
Lecture 26: SARSA & Q Learning
Lecture 27: n-step TD Learning
Lecture 28: Function Approximation
Lecture 29: Policy Gradient Theorem
Lecture 30: REINFORCE
Lecture 31: Actor-critic & Batch RL
Assignment:
Quiz:
Midsem:
Endsem: