Introduction to Bandits: Algorithms and Theory

Jean-Yves Audibert, Rémi Munos


    This tutorial intends to be an introduction to stochastic and adversarial multi-armed bandit algorithms and to survey some of the recent advances. In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and receives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.

    The bandit problem has been increasingly popular in the machine learning community. It is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [1, 6], economics [2, 12], games [7] and optimization [10, 5, 9, 3], model selection and machine learning algorithms itself [13, 4]. It can be a central building block of larger systems, like in evolutionary programming [8] and reinforcement learning [14], in particular in large state space Markovian Decision Problems [11].

Target audience

    This tutorial is for researchers interested in having an introduction to bandit problems and an overview on the growing literature around bandit problems. So there is no prior knowledge required from the participants, and we target a rather broad audience.

Content details

    1. Introduction/Motivation

    2. Bandit with small set of actions (slides)

    • Stochastic bandits: "optimism in face of the uncertainty" principle, Upper confidence based (UCB) policies [15,16,17,18,19,20]
    • Adversarial bandits: exponential exploration-exploitation (EXP3), implicitly normalized forecasters (INF) [21,18]

3. Bandit with large set of actions (slides)

    • Unstructured set of actions:
      • Many-armed bandit problem: the UCB Arm-Increasing Rule [23]
    • Structured set of actions:
      • Linear bandits: Confidence-Ball algorithms [24]
      • Lipschitz bandits: Hierarchical Optimistic Optimization and Zooming algorithms [3], [9]
      • Even more structure: Bandits in trees [11], [5], [22]
    • Conclusions / Extensions: contextual-bandits, adversarial bandits


    [1] M. Babaioff, Y. Sharma, and A. Slivkins, Characterizing truthful multi-armed bandit mechanisms : extended abstract, Proceedings of the tenth ACM conference on Electronic commerce, 2009, pp. 79-88.
    [2] D. Bergemann and J. Valimaki, Bandit problems, (2008), In The New Palgrave Dictionary of Economics, 2nd ed. Macmillan Press.
    [3] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari, Online optimization in X-armed bandits, Advances in Neural Information Processing Systems 21, 2008, pp. 201-208.

    [4] R. Busa-Fekete and B. Kegl, Fast boosting using adversarial bandits, Proceedings of the 27th International Conference on Machine Learning (ICML), 2010, Israel, 2010, pp. 143-150.
    [5] P.A. Coquelin and R. Munos, Bandit algorithms for tree search, Uncertainty in Artificial Intelligence, 2007.
    [6] N.R. Devanur and S.M. Kakade, The price of truthfulness for pay-per-click auctions, Proceedings of the tenth ACM conference on Electronic commerce, ACM, 2009, pp. 99-106.
    [7] S. Gelly and Y. Wang, Exploration exploitation in go : UCT for Monte-Carlo go, Online trading between exploration and exploitation Workshop, NIPS, 2006.
    [8] J.H. Holland, Adaptation in natural and artificial systems, MIT press Cambridge, MA, 1992.
    [9] R. Kleinberg, A. Slivkins, and E. Upfal, Multi-armed bandits in metric spaces, Proceedings of the 40th annual ACM symposium on Theory of computing, 2008, pp. 681-690.
    [10] R. D. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem, Advances in Neural Information Processing Systems 17, 2005, pp. 697-704.
    [11] L. Kocsis and Cs. Szepesvari, Bandit based Monte-Carlo planning, Proceedings of the 17th European Conference on Machine Learning (ECML-2006), 2006, pp. 282-293.
    [12] D. Lamberton, G. Pages, and P. Tarres, When can the two-armed bandit algorithm be trusted ?, Annals of Applied Probability 14 (2004), no. 3, 1424-1454.
    [13] V. Mnih, Cs. Szepesvari, and J.-Y. Audibert, Empirical Bernstein stopping, Proceedings of the 25th International Conference (ICML), vol. 307, 2008, pp. 672-679.
    [14] R. S. Sutton and A. G. Barto, Reinforcement learning : An introduction, MIT Press, 1998.
    [15] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Adv. in Appl. Probab., 27(4):1054–1078, 1995.
    [16] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem.SIAM J. Comput., 32(1):48–77, 2002
    [17] J.-Y. Audibert, R. Munos, and C. Szepesvari. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876–1902, 2009.
    [18] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11:2635–2686, 2010.
    [19] O-A. Maillard, R. Munos, and G. Stoltz. A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In COLT 2011
    [20] A. Garivier, O. Cappé, The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, In COLT 2011
    [21] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, 32(1):48-77, 2002.
    [22] S. Bubeck and R. Munos. Open Loop Optimistic Planning, In COLT 2010.
    [23] Y. Wang, J.-Y. Audibert, and R. Munos. Algorithms for infinitely many-armed bandits. In NIPS 2008.
    [24] V. Dani, T. Hayes, and S. Kakade. Stochastic Linear Optimization under Bandit Feedback. In COLT2008.
Sous-pages (1) : Slides