DA671-2024

DA 671: Introduction to Reinforcement Learning (Jan 2024)

The course is open to Ph.D. and M.Tech.

Lectures:

Mon: 5 PM - 5.55 PM, Tue: 4 PM-4.55 PM Wed: 3 PM -3.55 PM

Venue: 5002 (Core 5)

Instructor:

Dr. Arghyadip Roy, Assistant Professor, MFSDS&AI, IIT Guwahati



Course Objective:

The course covers the fundamentals of Reinforcement Learning. The course starts with an introduction to Markov chains and Markov decision process as a prelude to Reinforcement Learning. This course is designed to provide theoretical as well as application-oriented perspectives. Case studies and examples covered will include many real-life applications from various domains. It is expected that the students will be able to apply Reinforcement Learning, the main take-away of this course, in their pursuits in machine learning and diverse application areas. 


Course Content/ Syllabus:

Introduction to Reinforcement Learning (RL); Markov Chains: Discrete-time Markov chains, Stationary Distribution, Continuous-time Markov Chains. Markov Decision Process (MDP): Terminologies & Fundamentals; Finite Horizon MDP: Dynamic Programming (DP), Bellman Equation; Infinite Horizon Discounted Cost Problems: DP Equation, Fixed Point & Contraction Mapping, Value Iteration (VI), Policy Iteration (PI), Linear Programming formulation; RL Techniques: Multi-armed Bandit; Exploration & Exploitation, Optimism in the Face of Uncertainty, Upper Confidence Bound, Thompson Sampling; Monte-Carlo Methods: First-visit and Every-visit, Monte-Carlo Control, Off-policy Monte Carlo Control; Temporal Difference Learning: TD Prediction, Optimality of TD(0), SARSA, Q-learning, n-step TD Prediction, TD (lambda) algorithm, Linear Function Approximation, Linear TD (lambda); Policy Search Methods: Policy Gradient (PG) Theorem, REINFORCE, Variance Reduction in PG Methods, Actor-critic Methods, Batch RL. 


Texts:

 M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st Edition, Wiley, 2005.
R.S. Sutton and A.G. Barto, Reinforcement Learning: An introduction, 2nd Edition, MIT Press, 2018.

References:

D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, 3rd Edition, Athena Scientific, 2005.
D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II: Approximate Dynamic Programming, 4th Edition, Athena Scientific, 2012.
T. Lattimore and C. Szepesvári, Bandit Algorithms, 1st Edition, Cambridge University Press, 2020.
M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in Action, 1st Edition, Cambridge University Press, 2013.
Some more references: Lecture Notes by Lodewijk Kallenberg (Markov Decision Process), Shivaram Kalyanakrishnan (Foundations of Intelligent and Learning Agents), R. Srikant (Control of Stochastic Systems), Daniel J. Russo , Benjamin Van Roy , Abbas Kazerouni , Ian Osband and Zheng Wen  (A Tutorial on Thompson Sampling) and Richard Weber (Optimization & Control).

Grading Scheme:

2 Quizes:20%

Midsem: 20%

Endsem: 40%

Course project (Modeling + Programing): 20%

Students auditing the course must score above 50% in the course to be awarded an audit grade. 


Lecture Notes:

Lecture 1-2: Introduction to Reinforcement Learning

Lecture 3: Probability Review & Discrete Time Markov Chain

Lecture 4: Stationary Distribution

Lecture 5: Ergodicity Theory 

Lecture 6: Poisson Process & Exponential Distribution

Lecture 7: Continuous Time Markov Chain

Lecture 8: Markov Decision Process (MDP) Basics

Lecture 9: Finite Horizon Dynamic Programming 

Lecture 10: Finite Horizon Stochastic Control 

Lecture 11: Examples of Stochastic Dynamic Programming

Lecture 12: Infinite Horizon Discounted Cost MDP basics

Lecture 13: Existence of Optimal Policy 

Lecture 14: Contraction mapping

Lecture 15: Value Iteration Algorithm

Lecture 16: Policy Iteration Algorithm

Lecture 17:  Threshold Policy

Lecture  18: Multi-armed Bandit Basics

Lecture 19: Bandit Algorithms

Lecture 20: Upper Confidence Bound (UCB) Basics

Lecture 21: UCB Regret Analysis and KL-UCB Basics

Lecture 22: Thompson Sampling

Lecture 23: Monte Carlo Methods

Lecture 24: Off-policy Monte Carlo Methods

Lecture 25: TD Learning

Lecture 26: SARSA & Q Learning

Lecture 27: n-step TD Learning

Lecture 28: Function Approximation

Lecture 29: Policy Gradient Theorem

Lecture 30: REINFORCE

Lecture 31: Actor-critic & Batch RL


Assignment:




Quiz:




Midsem:



Endsem: