Week 1: Reward-Free Pre-Training and Exploration
W1 Reward-free Pre-Training and Exploration
Papers for Student Presenter 1:
Pathak, Agrawal, Efros, Darrel (2017). Curiosity-driven Exploration by Self-supervised Prediction
Sekar*, Rybkin*, Daniilidis, Abbeel, Hafner, Pathak (2020). Planning to Explore via Self-Supervised World Models
Papers for Student Presenter 2:
Sharma, Gu, Levine, Kumar, Hausman (2019). DADS: Dynamics-Aware Unsupervised Discovery of Skills
Warde-Farley, Van de Wiele, Kulkarni, Ionesu, Hansen, Mnih (2018). DISCERN: Unsupervised Control through Non-Parametric Discriminative Rewards
A fuller list of references to dive into this field (including the above shortlist)
Schmidhuber (1991). Curious Model-Building Control Systems
Oudeyer & Kaplan (2007). What is intrinsic motivation? A typology of computational approaches
Stadie, Levine, Abbeel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models
Achiam & Sastry (2017). Surprise-based Intrinsic Motivation for Deep Reinforcement Learning
Pathak, Agrawal, Efros, Darrel (2017). Curiosity-driven Exploration by Self-supervised Prediction
Burda*, Edwards*, Pathak*, Storkey, Darrel, Efros (2018). Large-Scale Study of Curiosity-Driven Learning
Sun, Gomez, Schmidhuber (2011). Planning to be Surprised: Optimal Bayesian Exploration in Dynamic Environments
Houthooft, Chen, Duan, Schulman, De Truck, Abbeel (2017) VIME: Variational Information Maximizing Exploration
Pathak, Gandhi, Gupta (2019). Self-Supervised Exploration via Disagreement
Sekar*, Rybkin*, Daniilidis, Abbeel, Hafner, Pathak (2020). Planning to Explore via Self-Supervised World Models
Klyubin, Polani, Nehaniv (2005). Empowerment: A Universal Agent-Centric Measure of Control
Salge, Glackin, Polani (2013). Empowerment -- An Introduction
Florensa, Duan, Abbeel (2017). SSN4HRL: Stochastic Neural Networks for HRL
Gregor, Rezende, Wierstra (2016). VIC: Variational Intrinsic Control
Eysenbach, Gupta, Ibarz, Levine (2018). DIAYN: Diversity Is All You Need
Achiam, Edwards, Amodei, Abbeel (2018). VALOR: Variational Option Discovery Algorithms
Warde-Farley, Van de Wiele, Kulkarni, Ionesu, Hansen, Mnigh (2018). DISCERN: Unsupervised Control through Non-Parametric Discriminative Rewards
Hansen, Dabney, Barreto, Warde-Farley, Van de Wiele, Mnih (2019). VISR: Fast Task Inference with Variational Intrinsic Successor Features
Gupta*, Eysenbach*, Finn, Levine (2018): Unsupervised Meta-Learning for RL
Baumli et al (2020). Relative VIC: Relative Variational Intrinsic Control
Sharma, Gu, Levine, Kumar, Hausman (2019). DADS: Dynamics-Aware Unsupervised Discovery of Skills
Zhao, Gao, Abbeel, Tresp, Xu (2021). MUSIC: Mutual Info State Intrinsic Control
Sikhbaatar, Lin, Kostrikov, Synnaeve, Szlam, Fergus (2017). ASP: Intrinsic Motivation and Automatic Curricular via Asymmetric Self-Play
OpenAI (2021). Asymmetric Self-Play for Automatic Goal Discovery in Robotic Manipulation
Strehl & Littman (2008). MBIE-EB: Model-Based Interval Estimation Exploration Bonus
Bellemare et al (2016). Unifying Count-based Exploration and Intrinsic Motivation
Ostrovski et al (2017). Count-based Exploration with Neural Density Models
Tang et al (2017). #Exploration: A Study of Count-based Exploration for Deep RL
Burda et al (2018). RND: Exploration by Random Network Distillation
Mutti, Pratissoli, Restelli (2020). MEPOL: Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
Liu & Abbeel (2020). APT: Unsupervised Active Pre-Training
Yarats, Fergus, Lazaric, Pinto (2021). ProtoRL: RL with Prototypical Representations
Liu & Abbeel (2021). APS: Active Pre-Training with Successor Features
Badia et al (2020). Never Give Up: Learning Directed Exploration Strategies
Hazan, Kakade, Singh, Van Soest (2019). Provably Efficient Maximum Entropy Exploration
Misra, Henfaff, Krishamurthy, Langford (2019). Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning
Lai & Robbins (1985). UCB: Upper Confidence Bounds
Kaelbling (1993). Interval Exploration
Kearns & Singh (2002). E3: Near-Optimal RL in Polynomial Time
Brafman & Tennenholtz (2002). RMax
Auer (2002). UCB Regret Bounds
Osband, Blundell, Pritzel, Van Roy (2016). Deep Exploration via Bootstrapped DQN
Russo, Van Roy, Kazerouni, Osband, Wen (2020). A Tutorial on Thompson Sampling
Chen, Sidor, Abbeel, Schulman (2017). Q-UCB: UCB Exploration via Q-Ensembles
Lowrey*, Rajeswaran*, Kakade, Todorov, Mordatch (2019). POLO
Lee, Laskin, Srinivas, Abbeel (2020). SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep RL
Andrychowicz et al (2017). HER: Hindsight Experience Replay
Schaul et al (2015). Universal Value Function Approximators
Kaelbing (1993). Learning to Achieve Goals
Nair, Pong, Dalal, Bahl, Lin, Levine (2018). RIG
Pong*, Dalal*, Lin*, Nair, Bahl, Levine (2019). Skew-Fit: State-Covering Self-Supevised RL
Florensa*, Held*, Geng*, Abbeel (2017). GoalGAN: Automatic Goal Generation for RL Agents
Colas, Fournier, Chetouani, Oudeyer (2019). CURIOUS
Zhang, Abbeel, Pinto (2020). Automatic Curriculum Learning through Value Disagreement
Ecoffet, Huizinga, Lehman, Stanley, Clune (2019). GoExplore
Also relevant, but leaving for the meta-computation week:
Coulom (2008). MCTS
Kocsis & Szepesvari (2006). MCTS
Gelly, Wang, Munos, Teytaud (2006). MCTS
Silver et al (2015). AlphaGo