References
(will be more updated soon)
Courses & Textbooks
- Reinforcement Learning: An Introduction (Barto & Sutton): http://incompleteideas.net/book/bookdraft2018jan1.pdf
- Algorithms for Reinforcement Learning (Szepesvári): https://sites.ualberta.ca/~szepesva/RLBook.html
- http://rll.berkeley.edu/deeprlcourse/#lectures
- http://www.cs.cornell.edu/courses/cs6783/2018sp/
- http://cs332.stanford.edu/#!index.md
- http://web.stanford.edu/class/cs234/index.html
- http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
- https://djrusso.github.io/RLCourse/index
- http://alekhagarwal.net/bandits_and_rl/
- http://www.yisongyue.com/courses/cs159/
- http://www-bcf.usc.edu/~haipengl/courses/CSCI699/
- More lists of resources:
Online Learning
- (Course Notes on Online Learning) Online Learning, by Gabor Bartok, David Pal, Csaba Szepesvari, and Istvan Szita.
- (Perceptron mistake bound) Perceptron Mistake Bounds, by Mehryar Mohri and Afshin Rostamizadeh. CoRR abs/1305.0208, 2013.
- (Survey Paper on Online Learning) Online Learning and Online Convex Optimization, by Shai Shalev-Shwartz. Foundations and Trends in Machine Learning, 4(11), 107-194, 2011.
Bandits
- (survey paper) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, by Sébastien Bubeck, Nicolò Cesa-Bianchi
- UCB
- Finite-time Analysis of the Multiarmed Bandit Problem, by Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer. Machine Learning, 47, 235-356, 2002.
- Linear-UCB
- Improved Algorithms for Linear Stochastic Bandits, by Yasin Abbasi-Yadkori, David Pal, and Csaba Czepesvari. Neural Information Processing Systems, 2011.
- Contextual bandits
- A Contextual-Bandit Approach to Personalized News Article Recommendation, by Lihong Li, Wei Chu, John Langford, and Robert Schapire. International World Wide Web Conference, 2010.
- Thompson Sampling
- Analysis of Thompson Sampling for the Multi-armed Bandit Problem, by Shipra Agrawal and Navin Goyal. Conference on Learning Theory, 2012.
- An Empirical Evaluation of Thompson Sampling, by Olivier Chapelle and Lihong Li. Neural Information Processing Systems, 2012.
- Dueling bandits
- The K-armed Dueling Bandits Problem, by Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. Journal of Computer and System Sciences, DOI:10.1016/j.jcss.2011.12.028, 2012.
- Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, by Yisong Yue and Thorsten Joachims. International Conference on Machine Learning, 2009.
Coactive Learning
- Coactive Learning, by Pannaga Shivaswamy and Thorsten Joachims. Journal of Artificial Intelligence Research, 53, 1-40, 2015.
- Stable Coactive Learning via Perturbation, by Karthik Raman, Thorsten Joachims, Pannaga Shivaswamy, and Tobias Schnabel. International Conference on Machine Learning, 2013.
- Learning to Diversify from Implicit Feedback, by Karthik Raman, Pannaga Shivaswamy, and Thorsten Joachims. ACM Conference on Web Search and Data Mining, 2012.
- Learning Trajectory Preferences for Manipulators via Iterative Improvement, by Ashesh Jain, Brian Wojcik, Thorsten Joachims, and Ashutosh Saxena. Neural Information Processing Systems, 2013.
Behavioral Cloning
- A Game-Theoretic Approach to Apprenticeship Learning, by Umar Syed and Robert Schapire. NIPS 2008.
- Apprenticeship Learning using Linear Programming, by Umar Syed, Michael Bowling, and Robert Schapire. ICML 2008
- Hierarchical Policy Networks
- Generative Multi-agent Behaviorial Cloning
More Imitation Learning
- Forward Training
- Efficient Reductions for Imitation Learning, by Stephane Ross, Drew Bagnell. AISTATS 2010.
- DAgger & Follow-up Work
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, by Stephane Ross, Geoff Gordon, and Drew Bagnell. International Conference on Artificial Intelligence and Statistics, 2011.
- Learning Policies for Contextual Submodular Prediction, by Stephane Ross, Jiaji Zhou, Yisong Yue, Debadeepta Dey, J. Andrew Bagnell. International Conference on Machine Learning, 2017.
- Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction, by Wen Sun, Arun Venkatraman, Geoff Gordon, Byron Boots, J. Andrew Bagnell. International Conference on Machine Learning, 2017.
- SEARN & Follow-up Work
- Search Based Structured Prediction, by Hal Daume, John Langford, Daniel Marcu. Machine Learning Journal 2009.
- Learning to Search Better than Your Teacher, by Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, John Langford. ICML 2015
- Generative Adversarial Imitation Learning
- Reduction of Behavioral Cloning to PAC Learning
- A Game-Theoretic Approach to Apprenticeship Learning, by Umar Syed and Robert Schapire. NIPS 2008.
- Learning to Search
- Learning to Search in Branch and Bound Algorithms, by He He, Hal Daume III, Jason M. Eisner. NIPS 2014.
- Learning to Search via Self-Imitation, by Jialin Song, Ravi Lanka, Albert Zhao, Yisong Yue, Masahiro Ono.
Inverse reinforcement learning
- Maximum entropy inverse reinforcement learning (Ziebart et al., AAAI 2008)
- Guided cost learning: deep inverse optimal control (Finn et al., ICML 2016)
- Apprenticeship learning via inverse reinforcement learning (Abbeel & Ng, ICML 2004)
- Exploration and Apprenticeship Learning in Reinforcement Learning, by Pieter Abbeel and Andrew Ng. International Conference on Machine Learning, 2005.
- Convergence of Value Aggregation for Imitation Learning, by Ching-An Cheng, Byron Boots. AISTATS 2018.
Basic Reinforcement Learning
- (survey) Bayesian Reinforcement Learning: A Survey, by Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Foundations and Trends in Machine Learning, 8(5-6), 359-483, 2015.
- A3C: Deep learning + Actor-critic (Mnih et al., ICML 2016)
- Policy gradient theorem (Sutton et al., ICML 1999)
- A Natural Policy Gradient, by Sham Kakade. Neural Information Processing Systems, 2002.
- Deep deterministic policy gradient (Lillicrap et al., ICLR 2015)
- Playing Atari with Deep Reinforcement Learning, by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Neural Information Processing Systems, 2015.
- Guided Policy Search, by Sergey Levine and Vladlen Koltun. International Conference on Machine Learning, 2013.
- An Application of Reinforcement Learning to Aerobatic Helicopter Flight, by Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Ng. Neural Information Processing Systems, 2007.
- Self-Optimizing Memory Controllers: A Reinforcement Learning Approach, by Engin Ipek, Onur Mutlu, Jose Martinez, and Rich Caruana. International Symposium on Computer Architecture, 2008.
Sparse Feedback in RL
- Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback, by Hal Daumé III, John Langford, Amr Sharaf. International Conference on Learning Representations, 2018.
- Hierarchical Imitation and Reinforcement Learning, by Hoang M. Le, Nan Jiang, Alekh Agarwal, Miro Dudík, Yisong Yue, Hal Daumé III.
Learning + Control
- On the sample complexity of the Linear Quadratic Regulator (Dean et al., arxiv 2017)
- PAC adaptive control of linear systems (Fiechter, COLT 1997)
- Least-squares temporal difference learning for LQR (Tu & Recht, arxiv 2017)
- Smooth Imitation Learning for Online Sequence Prediction, by Hoang Le, Andrew Kang, Yisong Yue, Peter Carr. International Conference on Machine Learning, 2016.
Safe Reinforcement Learning
- Safe Exploration in Markov Decision Processes, by Teodor Mihai Modolvan and Pieter Abbeel. International Conference on Machine Learning, 2012.
- Safe Exploration in Finite Markov Decision Processes with Gaussian Processes, by Matteo Turchetta, Felix Berkenkamp, Andreas Krause. NIPS 2016.
- Safe Model-based Reinforcement Learning with Stability Guarantees, by Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause. NIPS 2017
- Safe Exploration and Optimization of Constrained MDPs using Gaussian Processes, by Akifumi Wachi, Yanan Sui, Yisong Yue, Masahiro Ono. AAAI 2018.
- High Confidence Policy Improvement, by Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh. ICML 2015.
Constrained Policy Search in Reinforcement Learning
- Conservative policy iteration (Kakade & Langford, ICML 2002)
- Safe Policy Iteration, by Matteo Pirotta, Marcello Restelli, Alessio Pecorino, Daniele Calandriello. ICML 2013.
- Trust Region Policy Optimization, by John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. ICML 2015.
- Constrained Policy Optimization, by Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel. ICML 2017.
Multi-Task & Transfer in RL & IL
- Modular multitask reinforcement learning with policy sketches, by Jacob Andreas, Dan Klein, Sergey Levine. ICML 2017.
- One-Shot Visual Imitation Learning via Meta-Learning, by Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, Sergey Levine. Conference on Robot Learning 2017.
Off-policy learning
- Exploration Scavenging, by John Langford, Alexander Strehl, and Jenn Wortman Vaughan. International Conference on Machine Learning, 2008.
- Doubly Robust Policy Evaluation and Learning, by Miro Dudik, John Langford, and Lihong Li. International Conference on Machine Learning, 2011.
- Counterfactual Risk Minimization: Learning from Logged Bandit Feedback, by Adith Swaminathan and Thorsten Joachims. International Conference on Machine Learning, 2015.
- Doubly Robust Off-policy Value Evaluation for Reinforcement Learning, by Nan Jiang and Lihong Li. ICML 2016.
Monte Carlo Tree Search
- A Survey of Monte Carlo Tree Search Methods by Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 2012.
- Applying Monte Carlo Tree Search to Go) Mastering the game of Go with deep neural networks and tree search, by David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Nature, 529, 484–489, doi:10.1038/nature16961, 2016.
Other Forward Search in RL
- Truncated Horizon Policy Search: Combining Reinforcement Learning and Imitation Learning, by Wen Sun, Drew Bagnell, Byron Boots. ICLR 2018.
Theory
- Kakade's thesis on sample complexity of RL
- Near-Optimal Reinforcement Learning in Polynomial Time, by Michael Kearns and Satinder Singh. Machine Learning 49, 209-232, 2002.
- Contextual-MDP
Partial-observable RL
- Planning and acting in partially observable stochastic domains, by Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra
Adversarial & Multi-Agent
- Counterfactual regret minimization
- Regret Minimization in Games with Incomplete Information, by Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. NIPS 2008.
- Safe and Nested Subgame Solving for Imperfect-Information Games, by Noam Brown and Tuomas Sandholm. NIPS 2017.
- Multi-Agent Imitation Learning
- Coordinated Multi-Agent Imitation Learning, by Hoang Le, Yisong Yue, Peter Carr, Patrick Lucey. International Conference on Machine Learning 2017.