Sample Efficient Reinforcement Learning

Abstract:

In many real-world reinforcement learning applications, training agent on the real environment is often prohibitively expensive if not impossible. Therefore, we have studied various ways of speeding up the convergence of learning with limited samples. In this talk, we will present recently completed works as well as an on-going effort to improve the efficiency of reinforcement learning. Specifically, we will share our methods (1) to utilize advises from multiple experts of varying quality, (2) to transfer previously gained knowledge to a novel task, and (3) to train on a virtual environment with limited samples from the physical one. Finally, we share our efforts on multi-agent reinforcement learning with a particular focus on mechanism design.

Bio:

Chi-Guhn Lee is a Professor of Industrial Engineering and the Director of the Centre for Maintenance Optimization and Reliability Engineering (C-MORE) at the University of Toronto. His research interest includes reinforcement learning, Markov decision process, deep learning, supply chain optimization and physical asset management. Recent and on-going projects cover topics such as transfer learning, domain adaptation and Bayesian learning with applications from supply chain, equipment diagnosis. He has focused on both applications and theory, and published in machine learning conferences such as NeurIPS, ICLR, UAI as well journals such as Operations Research, IEEE Transactions on Industrial Electronics, Mechanical Systems and Signal Processing. He has also worked closely with private firms including Nestle, LG, IBM, General Motors, Magna International, Fujitsu, State Grid Corp of China to name a few.

E-mail cglee@mie.utoronto.ca

Summary

Reinforcement learning (RL):
- Doesn’t need previously collected data but need to specify an environment to interact with
- Data collected during interactions with environment
- In practice this collection of interaction data ends up being quite expensive
Goal: reduce cost of collecting data by RL
MDP & RL
- Dynamic optimization problem setting: Markov Decision Process (MDP)
- State -> Decision -> Updated state (new probability distribution) -> Next Decision , etc.
- Examples: truck workloads ( repair -> workload -> repair -> …), robot locomotion, games
- The space of possible moves is small but the space games is enormous (e.g. 10120 chess game)
- Reinforcement learning algorithms explore the state of MDPs by learning the function that predicts the outcomes of decisions in different states of the environment
- Example: Q-learning creates a table of possible outcomes of possible actions in each state
- Challenge: Existing optimization algorithms take large number of samples, so the cost for many real use-cases is unacceptably high
Sample Efficient Algorithms
- Scenario: Transfer learning
- Learn how to ride a tricycle and transfer the skills to riding a bicycle
- Questions:
  - Which tasks are useful for which other tasks
  - How do you transfer knowledge across tasks?
  - Study: Transfer RL with Multiple Experts
    - Extend traditional RL with Bayesian Inference
    - Multiple experts advising the agent, their quality varies
    - Agent needs to understand which expert to trust
- Extension: situation-aware transfer learning
  - Choose which expert to trust in a given region of the state space
  - E.g. Lunar lander navigation
    - Different agents are good at hovering, landing, etc.
    - Choice of which agent to trust depends on the current state
  - E.g. COVID shut-down policies
    - Experts: SIR models of disease spread trained on different COVID variants
    - Combined these experts to predict optimal policy for a new COVID variant
Sim2Real in Real
- Idea: train RL policy in a simulation, then transfer policy to real system
- There is a mismatch between the dynamics in the two settings
- Approach:
  - Run a basic simulation and then adapt it to reduce its prediction error
  - Adaptation is an extra layer that takes simulation state transition as input and then takes another random state transition
  - Goal: learn a better state transition function
  - This can produce a more accurate simulation with fewer samples of real system (e.g. robot), which are expensive
  - We can then train our policy on this improved simulation.

Page updated

Report abuse