Week 2: Self-play and Curriculum
Papers for Lecture 1
AlphaGo: Silver, Huang, Maddison, Guez, ..., Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.
AlphaGo Zero: Silver, Schrittwieser, Simonyan, ..., Hassabis (2017). Mastering the Game of Go without Human Knowledge. Nature, 550, 354–359.
AlphaZero: Silver, Hubert, Schrittwieser, ..., Hassabis (2018) A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419):1140–1144. See also preprint with methods and tables.
Papers for Student Presenter 1:
Bansal, Pachocki, Sidor, Sutskever, and Mordatch (2018). Emergent Complexity via Multi-Agent Competition. ICLR-18.
Baker, Kanitscheider, Markov, Wu, Powell, McGrew, and Mordatch (2020). Emergent tool use from multi-agent autocurricula. (See also the OpenAI Research Blog article, with lots of videos.)
Papers for Student Presenter 2:
Sukhbaatar, Lin, Kostrikov, Synnaeve, Szlam, Fergus (2018). Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. ICLR-18.
Dennis, Jaques, Vinitsky, Bayen, Russell, Critch, Levine, Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design.'' NeurIPS-20.
Papers for Further Reading:
Arthur Samuel (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 210-229.
Arthur Samuel (1967). Some studies in machine learning using the game of checkers II—Recent progress. IBM Journal of Research and Development, 11, 601–617.
Gerald Tesauro (1992). Practical Issues in Temporal Difference Learning. Machine Learning, 8, 257-277.
Gerald Tesauro and Gregory Galperin (1996). On-line Policy Improvement using Monte-Carlo Search. NeurIPS 1996.
Bruce Abramson and Richard Korf (1987). A Model of Two-Player Evaluation Functions. AAAI-87.
Bruce Abramson (1990). Expected-outcome: A general model of static evaluation. IEEE PAMI, 12, 182-193.
Bernd Brügmann (1993). Monte Carlo Go. AAAI Fall Symposium on Games: Playing, Planning, and Learning.
Bruno Bouzy and Bernard Helmstetter (2004). Monte Carlo Go developments. In H. J. Van Den Herik et al. (eds.), Advances in Computer Games.
Levente Kocsis and Csaba Szepesvari (2006). Bandit-based Monte-Carlo planning. ECML-06.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.
Thomas Ferguson (undated). Optimal stopping and applications. (See esp. Ch. 7 on Bandit problems.)
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294.
D. Russo, B. Van Roy, A. Kazerouni, I. Osband and Z. Wen (2018). A Tutorial on Thompson Sampling," Foundations and Trends in Machine Learning, 11, 1-96.
Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maximizing the sum of n observations. Ann. Math. Statist., 27, 1060–1074. Early results on bandit problems. Includes the comment that "In Section 3 it is shown that several properties which one intuitively expects the optimum strategy to possess are not, in general, characteristics of the optimum strategy."
Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J. (Ed.), Progress in Statistics. North-Holland. (The original Gittins index paper; seems not to be easily available online.)
John C. Gittins. (1979). Bandit Processes and Dynamic Allocation Indices. JRSS (Series B), 41, 148-177.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256. (Introduces the UCB heuristic for regret minimization in bandits.)
Sébastien Bubeck, Rémi Munos, and Gilles Stoltz (2009). Pure Exploration in Multi-armed Bandits Problems. ALT-09. (A version of bandit theory more appropriate for computational branch selection, with an exploration cost but no exploitation cost.)
Bechhofer, R. (1954). A single-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25, 16–39. (Introduces selection problems, which are a better match than bandit problems for computational branch selection.)
Elwyn Berlekamp, John Conway, and Guy (1982). Winning ways for your mathematical plays, Vol. 1. (Introduces combinatorial game theory, which involves multiple game instances coupled by the constraint that each player can play in only one game on their turn. Their investigation led to the discovery of a new class of numbers (surreal numbers), complementing those on the real line.)
Dylan Hadfield-Menell and Stuart Russell, Multitasking: Efficient Optimal Planning for Bandit Superprocesses. In Proc. UAI-15, 2015. (Includes references for the key papers on bandit superprocesses or BSPs, in which each arm is an MDP rather than MRP. An important property of BSPs is that the optimal solution is NOT composed of optimal solutions to the individual MDPs; in other words, within an MDP, the optimal behavior is affected by what other problems you need to work on in addition to this one. So, unless you have only one problem in the world, standard MDP theory won't necessarily help you.)