In this course, we will study data-driven decision problems: optimisation problems for which the objective function (i.e., the relation between decision and outcome) is unknown upfront, and has to be learned from accumulating data. These problems have an intrinsic tension between statistical goals and optimisation goals: learning how the system behaves (the statistical goal) is accelerated by experimenting with different actions, while for making good decisions (the optimisation goal), one would like to limit experimentation and use estimated optimal decisions. We will study this exploration-exploitation trade-off for so-called multi-armed bandit problems, the paradigmatic framework for dynamic optimisation problems with incomplete information. We will discuss standard building blocks of the state-of-the-art theory, and we will discuss applications such as dynamic pricing and assortment optimisation problems.
Bandit Algorithms by Tor Lattimore and Csaba Szepesvári
Solutions to Selected Exercises in Bandit Algorithms