Tutorial on Model-Based Methods in Reinforcement Learning

By Igor Mordatch and Jessica Hamrick

Presented at International Conference on Machine Learning (ICML) 2020

Abstract

This tutorial presents a broad overview of the field of model-based reinforcement learning (MBRL), with a particular emphasis on deep methods. MBRL methods utilize a model of the environment to make decisions—as opposed to treating the environment as a black box—and present unique opportunities and challenges beyond model-free RL. We discuss methods for learning transition and reward models, ways in which those models can effectively be used to make better decisions, and the relationship between planning and learning. We also highlight ways that models of the world can be leveraged beyond the typical RL setting, and what insights might be drawn from human cognition when designing future MBRL systems.

Goals

The field of reinforcement learning has produced significantly impressive results in recent years, but has largely focused on model-free methods. However, the community recognizes limitations of purely model-free methods, from high sample complexity, need of sampling unsafe outcomes, to stability and reproducibility issues. By contrast, model-based methods have been under-explored (but growing fast) in the machine learning community despite being very influential in robotics, engineering, and cognitive and neural sciences. They provide a distinct set of advantages and challenges as well as complementary mathematical tools. The aim of this tutorial is to make model-based methods more recognized and accessible to the machine learning community. Given recent successful applications of model-based planning, such as AlphaGo, we believe there is timely demand for a comprehensive understanding of this topic. By the end of the tutorial, the audience should gain:


  • Mathematical background to read and follow up with the literature on the topic.

  • An intuitive understanding of the algorithms involved (and have access to lightweight example code they can use and experiment with).

  • Awareness of the tradeoffs and challenges involved in applying model-based methods.

  • Appreciation for diversity of problems in which model-based reasoning can be applied.

  • Understanding of how these methods fit in the broader context of reinforcement learning and theories of decision-making as well as relationship to model-free methods.

Target audience and required background

This tutorial will be accessible to the general machine learning audience, but specifically targeted at the following groups with their own specific learning goals:

  • Reinforcement learning researchers and practitioners who primarily worked with model-free methods and are looking to acquire a new set of techniques and background to complement or address challenges they are currently facing.

  • Supervised or unsupervised learning researchers who are looking to learn how their work can be applicable in reinforcement learning setting.

  • Cognitive science researchers who may be familiar with the core ideas of the topic, but are looking to learn about algorithms and implementation guidelines that are practical in complex high-dimensional settings.

  • Robotics researchers and practitioners that are familiar with model-based control, but are looking for background and advice on how to unite them with learning methods.

Familiarity with basic supervised learning methods will be expected, and some familiarity with reinforcement learning formulation and model-free methods will be beneficial but not required.

Bibliography

  1. Abraham (2020). The Cambridge Handbook of the Imagination.

  2. Agostinelli et al. (2019). Solving the Rubik's Cube with Deep Reinforcement Learning and Search. Nature Machine Intelligence.

  3. Allen, Smith, & Tenenbaum (2019). The tools challenge: Rapid trial-and-error learning in physical problem solving. CogSci 2019.

  4. Amos et al (2018). Differentiable MPC for End-to-end Planning and Control. NeurIPS 2018.

  5. Amos et al (2019). The Differentiable Cross-Entropy Method. arXiv.

  6. Anthony et al (2017). Thinking Fast and Slow with Deep Learning and Tree Search. NeurIPS.

  7. Bellemare et al (2016). Unifying count-based exploration and intrinsic motivation. NeurIPS.

  8. Bellemare et al. (2013). The Arcade Learning Environment: An Evaluation Platform for General Agents. JAIR.

  9. Blundell et al (2015). Weight Uncertainty in Neural Networks. ICML 2015.

  10. Buesing et al. (2018). Learning and Querying Fast Generative Models for Reinforcement Learning. ICML 2018.

  11. Burgess et al. (2019). MONet: Unsupervised Scene Decomposition and Representation. arXiv.

  12. Byravan et al (2019). Imagined Value Gradients. CoRL 2019.

  13. Chiappa, Racaniere, Wierstra, Mohamed (2017). Recurrent environment simulators. ICLR 2017.

  14. Choromanski et al (2019). Provably Robust Blackbox Optimization for Reinforcement Learning. CoRL 2019.

  15. Chua, Calandra, McAllister, & Levine (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. NeurIPS 2018.

  16. Clement (2009). The role of imagistic simulation in scientific thought experiments. Topics in Cognitive Science, 1(4), 686-710.

  17. Corneil et al. (2018). Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation. ICML.

  18. Craik (1943). The Nature of Explanation. Cambridge University Press.

  19. Dasgupta, Smith, Schulz, Tenenbaum, & Gershman (2018). Learning to act by integrating mental simulations and physical experiments. CogSci 2018.

  20. Deisenroth & Rasmussen (2011). PILCO: A Model-Based and Data-Efficient Approach to Policy Search. ICML 2011.

  21. Depeweg, Hernández-Lobato, Doshi-Velez, & Udluft, S. (2017). Learning and policy search in stochastic dynamical systems with bayesian neural networks. ICLR 2017.

  22. Du et al (2019). Model-Based Planning with Energy Based Models. CoRL 2019

  23. Dubey, Agrawal, Pathak, Griffiths, & Efros (2018). Investigating human priors for playing video games. ICML 2018.

  24. Ebert, Finn, et al. (2018). Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv.

  25. Ecoffet et al. (2019). Go-explore: a new approach for hard-exploration problems. arXiv.

  26. Edwards, Downs, & Davidson (2018). Forward-Backward Reinforcement Learning. arXiv.

  27. Ellis et al. (2019). Write, Execute, Assess: Program Synthesis with a REPL. NeurIPS.

  28. Eysenbach, Salakhutdinov, & Levine (2019). Search on the replay buffer: Bridging planning and reinforcement learning. NeurIPS.

  29. Farquhar et al (2017). TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning. ICLR 2018.

  30. Fazeli et al. (2019). See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics, 4(26).

  31. Finke & Slayton (1988). Explorations of creative visual synthesis in mental imagery. Memory & Cognition, 16(3).

  32. Finn & Levine (2017). Deep visual foresight for planning robot motion. ICRA.

  33. Finn, Goodfellow, & Levine (2016). Unsupervised learning for physical interaction through video prediction. NeurIPS.

  34. Fisac et al. (2019). A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems. IEEE Transactions on Automatic Control.

  35. Gal (2016). Uncertainty in Deep Learning. PhD Thesis.

  36. Gal, McAllister, & Rasmussen, (2016). Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML.

  37. Griffiths & Tenenbaum (2009). Theory-based causal induction. Psychological Review, 116(4).

  38. Gu, Lillicrap, Sutskever, & Levine (2016). Continuous Deep Q-Learning with Model-based Acceleration. ICML 2016.

  39. Guez et al (2019). An Investigation of Model-Free Planning. ICML 2019.

  40. Ha & Schmidhuber (2018). World Models. NeurIPS 2018.

  41. Hafner et al (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020.

  42. Hamrick (2019). Analogues of mental simulation and imagination in deep learning. Current Opinion in Behavioral Sciences, 29, 8-16.

  43. Hamrick et al (2020). Combining Q-Learning and Search with Amortized Value Estimates. ICLR 2020.

  44. Hamrick et al. (2017). Metacontrol for adaptive imagination-based optimization. ICLR 2017.

  45. Heess et al (2015). Learning Continuous Control Policies by Stochastic Value Gradients. NeurIPS 2015.

  46. Houthooft et al (2016). VIME: Variational Information Maximizing Exploration. NeurIPS 2016.

  47. Jacobson and Mayne (1970). Differential Dynamic Programming.

  48. Jaderberg et al. (2017). Reinforcement learning with unsupervised auxiliary tasks. ICLR 2017.

  49. Jang, Gu, & Poole (2017). Categorical Reparameterization with Gumbel-Softmax. ICLR 2017.

  50. Janner et al (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS 2019.

  51. Jurgenson et al. (2019). Sub-Goal Trees -- A Framework for Goal-Directed Trajectory Prediction and Optimization. arXiv.

  52. Kaelbling & Lozano-Pérez (2011). Hierarchical Task and Motion Planning in the Now. ICRA.

  53. Kahneman (2011). Thinking, Fast and Slow.

  54. Kalakrishnan et al (2011). STOMP: Stochastic trajectory optimization for motion planning. ICRA 2011.

  55. Kidambi et al (2020). MOReL: Model-Based Offline Reinforcement Learning. arXiv.

  56. Konidaris, Kaelbling, & Lozano-Pérez (2018). From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning. Journal of Artificial Intelligence Research, 61.

  57. Kovar, Gleicher, & Pighin (2002). Motion graphs. ACM Transactions on Graphics, 21(3).

  58. Kurutach et al. (2018). Learning Plannable Representations with Causal InfoGAN. NeurIPS.

  59. Kwakernaak & Sivan (1972). Linear Optimal Control Systems.

  60. Laskin, Emmons, Jain, Kurutach, Abbeel, & Pathak (2020). Sparse Graphical Memory for Robust Planning. arXiv.

  61. Levine & Abbeel (2014). Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics. NeurIPS 2014.

  62. Levine, Wagener, & Abbeel (2015). Learning Contact-Rich Manipulation Skills with Guided Policy Search. ICRA 2015.

  63. Levine et al (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv.

  64. Lin et al (2020). Model-based Adversarial Meta-Reinforcement Learning. arXiv.

  65. Lowrey et al. (2019). Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. ICLR 2019.

  66. Lu, Mordatch, & Abbeel (2019). Adaptive Online Planning for Continual Lifelong Learning. NeurIPS Deep RL Workshop.

  67. Maddison, Mnih, & Teh (2017). The Concrete Distribution. ICLR 2017.

  68. Mannor et al. (2003). The Cross-Entropy Method for fast policy search. ICML 2003.

  69. Markman, Klein, & Suhr (2008). Handbook of Imagination and Mental Simulation.

  70. Mordatch et al. (2012). Discovery of Complex Behaviors through Contact-Invariant Optimization. ACM Transactions on Graphics.

  71. Mordatch et al (2015). Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids. IROS 2015.

  72. Mordatch et al (2015). Interactive Control of Diverse Complex Characters with Neural Networks. NeurIPS 2015.

  73. Nagabandi et al (2019). Deep Dynamics Models for Learning Dexterous Manipulation. CoRL 2019.

  74. Nagabandi et al. (2019). Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning. ICLR.

  75. Nair, Babaeizadeh, Finn, Levine & Kumar (2020). Time Reversal as Self-Supervision. ICRA 2020.

  76. Nair, Pong, et al. (2018). Visual Reinforcement Learning with Imagined Goals. NeurIPS.

  77. Nasiriany et al. (2019). Planning with Goal-Conditioned Policies. NeurIPS.

  78. Neal (1995). Bayesian Learning for Neural Networks.

  79. Oh, Guo, Lee, Lewis, & Singh (2015). Action-Conditional Video Prediction using Deep Networks in Atari Games. NIPS 2015.

  80. Oh et al. (2017). Value Prediction Network. NeurIPS.

  81. OpenAI et al. (2020). Learning Dexterous In-Hand Manipulation. International Journal of Robotics Research, 39(1), 3-20.

  82. OpenAI et al. (2019). Solving Rubik's Cube with a Robot Hand. arXiv.

  83. Osband et al (2018). Randomized Prior Functions for Deep Reinforcement Learning. NeurIPS 2018.

  84. Parascandolo, Buesing, et al. (2020). Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning. arXiv.

  85. Pascanu, Li, et al. (2017). Learning model-based planning from scratch. arXiv.

  86. Pathak et al. (2017). Curiosity-driven exploration by self-supervised prediction. ICML.

  87. Peters, Mulling, & Altun (2010). Relative Entropy Policy Search. AAAI 2010.

  88. Posa, Cantu, & Tedrake (2013). A Direct Method for Trajectory Optimization of Rigid Bodies Through Contact. WAFR 2012.

  89. Rajeswaran et al. (2017). EPOpt: Learning Robust Neural Network Policies Using Model Ensembles. ICLR 2017.

  90. Rajeswaran et al. (2020). A Game Theoretic Framework for Model Based Reinforcement Learning. arXiv.

  91. Ross, Gordon, & Bagnell (2011). A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011.

  92. Sadigh et al. (2016). Planning for autonomous cars that leverage effects on human actions. RSS 2016.

  93. Salas & Powell (2013). Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of multidimensional energy storage problems.

  94. Sanchez-Gonzalez et al. (2018). Graph Networks as Learnable Physics Engines for Inference and Control. ICML 2018.

  95. Sandberg (2018). Blueberry Earth. arXiv.

  96. Savinov, Dosovitskiy, & Koltun (2018). Semi-parametric topological memory for navigation. ICLR 2018.

  97. Schmidhuber (1990). Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments." Institut für Informatik, Technische Universität München. Technical Report FKI-126 90.

  98. Schmidhuber (1991). Curious model-building control systems. IJCNN.

  99. Schrittwieser et al. (2019). Mastering Atari, Go, Chess and Shogi by planning with a learned model. arXiv.

  100. Segler, Preuss, & Waller (2018). Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698).

  101. Sharma et al. (2020). Dynamics-Aware Unsupervised Discovery of Skills. ICLR.

  102. Shen et al. (2019). M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search. NeurIPS.

  103. Silver, van Hasselt, Hessel, Schaul, Guez, Harley, Dulac-Arnold, Reichert, Rabinowitz, Barreto, Degris (2017). The Predictron: End-To-End Learning and Planning. ICML 2017.

  104. Silver et al (2017). Mastering the game of Go without human knowledge. Nature, 550, 354-359.

  105. Silver et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.

  106. Spelke & Kinzler (2007). Core knowledge. Developmental science, 10(1).

  107. Stulp & Sigaud (2012). Path Integral Policy Improvement with Covariance Matrix Adaptation. ICML 2012.

  108. Sutton (1990). Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4).

  109. Sutton and Barto (2018). Reinforcement Learning: An Introduction.

  110. Tamar et al. (2016). Value iteration networks. NeurIPS 2016.

  111. Tamar et al. (2017). Learning from the Hindsight Plan – Episodic MPC Improvement. ICRA 2017.

  112. Tang et al. (2017). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning. NeurIPS 2017.

  113. Tassa et al. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization. IROS 2012.

  114. Theodorou et al. (2010). Learning policy Improvements with path integrals. AISTATS 2010.

  115. Todorov & Li (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. American Control Conference.

  116. Todorov, Erez, & Tassa (2012). MuJoCo: A physics engine for model-based control. IROS.

  117. Tzeng et al. (2017). Adapting Deep Visuomotor Representations with Weak Pairwise Constraints. In: Goldberg K., Abbeel P., Bekris K., Miller L. (eds) Algorithmic Foundations of Robotics XII. Springer Proceedings in Advanced Robotics, Vol 13.

  118. van den Oord, Li, & Vinyals (2019). Representation Learning with Contrastive Predictive Coding. arXiv.

  119. van Hasselt, Hessel, & Aslanides (2019). When to use parametric models in reinforcement learning? NeurIPS 2019.

  120. Veerapaneni, Co-Reyes, Chang, et al. (2019). Entity Abstraction in Visual Model-Based Reinforcement Learning. CoRL 2019.

  121. Venkatraman et al. (2014). Data as demonstrator with applications to system identification.

  122. Watter, Springenberg, Boedecker, & Riedmiller (2015). Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. NIPS 2015.

  123. Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D., … Wierstra, D. (2017). Imagination-augmented agents for deep reinforcement learning. NeurIPS 2017.

  124. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229-256.

  125. Williams et al. (2017). Information Theoretic MPC for Model-Based Reinforcement Learning. ICRA 2017.

  126. Wu et al. (2015). Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning. NeurIPS 2015.

  127. Yip & Camarillo (2014). Model-Less Feedback Control of Continuum Manipulators in Constrained Environments. IEEE Transactions on Robotics, 30(4).

  128. Yu et al (2020). MOPO: Model-based Offline Policy Optimization. arXiv.

  129. Zhang, Lerer, et al. (2018). Composable Planning with Attributes. ICML 2018.

Questions or Feedback?

Contact us at jhamrick@ or imordatch@ (followed by google.com)