Instructor: Shubhada Agrawal
Time: Mondays and Wednesdays, 8:15 - 9:45 am
Room: EC 1.07
Teams group: If you have an IISc email ID, you can join the Teams group using the code: 3lwoz2a 
Course description: This course introduces the theory and algorithms underlying the multi-armed bandit (MAB) problem in various settings, along with the fundamental limits of the framework (lower bounds). The first part focuses on regret minimization in stochastic MABs, covering algorithms such as UCB, KL-UCB, IMED, and Thompson Sampling, and their regret analysis, modern design principles like Information-Directed Sampling and the Decision Estimation Coefficient, and lower bound techniques using both classical change-of-measure arguments and modern information-theoretic tools.
The second part of the course focuses on sequential active hypothesis testing, which includes the best-arm identification (BAI) setting of MAB. Topics include Wald’s power-one sequential testing framework, along with modern lower bounds and sample complexity analysis for optimal algorithms in point-vs-point, composite-vs-point, and composite-vs-composite settings. We will then cover the classical BAI problem under fixed-confidence and fixed-budget settings, including state-of-the-art lower bounds, impossibility results, and algorithms such as LUCB, Track-and-Stop, and Bayesian approaches like top-two sampling and its connection to Thompson Sampling. We will cover connections between BAI and Wald’s testing framework, the gamification approach to BAI, and BAI with multiple correct answers, and epsilon-BAI.
In the final part of the course, we will explore linear bandits and optimal experimental design principles, and some generalizations such as dueling bandits (e.g., Relative UCB), stochastic partial monitoring (an application of the Information-Directed Sampling principle), and restless bandits (a generalization of MABs with time-evolving arms).
Pre-requisite: A course on either probability theory, random processes, or measure theory.
Evaluation (tentative): 50% assignments, 30% final exam, 20% student presentation/in-class quizzes.
References:
Introduction to Multi-Armed Bandits, Aleksandrs Slivkins (link)
Bandit Algorithms, Tor Lattimore and Csaba Szepesvári (link)
A Tutorial on Thompson Sampling: (link)
Learning to Optimize via Information-Directed Sampling, Russo, Daniel, and Benjamin Van Roy, Operations Research 66.1 (2018): 230-252
Topics on decision estimation coefficient from a course at MIT: (link)
Multi‐Armed Bandit Allocation Indices, John C. Gittins, Kevin D. Glazebrook, and Richard Weber