ELE524 (2020 Ver.)

ELE524: Foundations of Reinforcement Learning

Spring 2020, Tue. Thu. 09:30 am - 10:50 am, Friend Center 006,

Instructor: Chi Jin Office hour: Tue. 4:00 pm - 5:00 pm, C-332 Equad
TA: Zhiyuan Li Office hour: Wed. 3:30 pm - 4:30 pm, 315 CS building
Contents: Mathematical foundations of RL, mostly about theorems and proofs.
Grades: 5 problem sets (60%), 1 scribe note (10%) and 1 final exam (30%). Final will take 40% for students who have not done scribe note.
No late homework. For two students scribing for the same lecture, please submit a merged version within one week of the lecture.
Scribe note [sign up sheet] and [template].

Lecture Notes

2/4. Intro [slide], MAB and MDP basics [draft] [note].
2/6. Concentration inequalities [draft] [note]. (see also Chapter 2 of [Ver 2020])
2/11. Uniform concentration, Bellman equation [draft][note]
2/13. MDP planning [draft][note]. (see also Chapter 1 of [AJK 2019])
2/18. Generative model, value iteration (coarse analysis) [draft][note]. (see also Chapter 2 of [AJK 2019])
2/20. Value iteration (refined analysis), Q-learning [draft][note].
2/25. Generative model summary, multi-arm bandit [draft][note]. (see also Part II of [LS 2018])
2/27. UCB algorithm, MDP exploration [draft][note].
3/3. UCB-VI algorithm. [draft][note]. (see also [AOM 2017])
3/5. Q-learning with UCB, MDP summary. [draft][note]. (see also [JABJ 2018])
3/10. Lower bound for MAB. [draft][note].
3/12. Lower bound for MDP. [draft][note].
3/24. Policy gradient for MAB. [video][draft][note].
3/26. Policy gradient for MDP. [draft][note].
3/31. PPO, TROP and natural policy gradient algorithms. [video][draft][note].
4/2. Linear quadratic regulator. [video][draft][note].
4/7. Function approximation overview, linear MDP. [video][draft][note]
4/9. Least-Squares Value Iteration (LSVI). [video][draft][note]
4/14. LSVI with UCB, Fitted Q-Iteration (FQI). [video][draft][note]
4/16. Analysis for FQI and Bellman rank intro. [video][draft][note]
4/21. (Guest lecture by Akshay Krishnamurthy) Bellman rank and OLIVE algorithm. [video]
4/28. Matrix games, Markov games. [video][draft][note]
4/30. Partially Observable MDP, Predictive State Representation (PSR). [video1][video2][draft][note]

Schedule (weekly basis)

Basics (tabular MDP):

Intro, MAB and MDP basics, concentration inequalities.
MDP Planning.
Generative models, TD algorithms.
Exploration in MAB: epsilon-greedy and UCB. [Homework 1 due]
Exploration in RL.
Minimax lower bound. [Homework 2 due]

Advanced Topics:

Policy optimization.
Linear quadratic regulator. [Homework 3 due]
Linear Function approximation.
General Function approximation. [Homework 4 due]
Off-policy evaluation / optimization.
Markov Games, Partial observable MDP. [Homework 5 due]

Reference Readings

Reinforcement Learning: Theory and Algorithms (draft), by Alekh Agarwal, Nan Jiang, Sham M. Kakade
Reinforcement learning: an introduction, by Richard S. Sutton, Andrew G. Barto
Algorithms for Reinforcement Learning, by Csaba Szepesvári
Bandit Algorithms, by Tor Lattimore, Csaba Szepesvari

Mathematical Tools

High dimensional probability. An introduction with applications in Data Science, by Roman Vershynin
Concentration inequalities and martingale inequalities — a survey, by Fan Chung, Linyuan Lu

Related Courses

Alekh Agarwal and Sham Kakade, Reinforcement Learning and Bandits
Nan Jiang, Statistical Reinforcement Learning
Alekh Agarwal and Alex Slivkins, Bandits and Reinforcement Learning

More practical/empirical version (will not be covered in this course)

Sergey Levine, Deep Reinforcement Learning
Shipra Agrawal, Reinforcement Learning