Sample Efficient Reinforcement Learning
Abstract:
In many real-world reinforcement learning applications, training agent on the real environment is often prohibitively expensive if not impossible. Therefore, we have studied various ways of speeding up the convergence of learning with limited samples. In this talk, we will present recently completed works as well as an on-going effort to improve the efficiency of reinforcement learning. Specifically, we will share our methods (1) to utilize advises from multiple experts of varying quality, (2) to transfer previously gained knowledge to a novel task, and (3) to train on a virtual environment with limited samples from the physical one. Finally, we share our efforts on multi-agent reinforcement learning with a particular focus on mechanism design.
Bio:
Chi-Guhn Lee is a Professor of Industrial Engineering and the Director of the Centre for Maintenance Optimization and Reliability Engineering (C-MORE) at the University of Toronto. His research interest includes reinforcement learning, Markov decision process, deep learning, supply chain optimization and physical asset management. Recent and on-going projects cover topics such as transfer learning, domain adaptation and Bayesian learning with applications from supply chain, equipment diagnosis. He has focused on both applications and theory, and published in machine learning conferences such as NeurIPS, ICLR, UAI as well journals such as Operations Research, IEEE Transactions on Industrial Electronics, Mechanical Systems and Signal Processing. He has also worked closely with private firms including Nestle, LG, IBM, General Motors, Magna International, Fujitsu, State Grid Corp of China to name a few.
E-mail cglee@mie.utoronto.ca
Summary
Reinforcement learning (RL):
Doesn’t need previously collected data but need to specify an environment to interact with
Data collected during interactions with environment
In practice this collection of interaction data ends up being quite expensive
Goal: reduce cost of collecting data by RL
MDP & RL
Dynamic optimization problem setting: Markov Decision Process (MDP)
State -> Decision -> Updated state (new probability distribution) -> Next Decision , etc.
Examples: truck workloads ( repair -> workload -> repair -> …), robot locomotion, games
The space of possible moves is small but the space games is enormous (e.g. 10120 chess game)
Reinforcement learning algorithms explore the state of MDPs by learning the function that predicts the outcomes of decisions in different states of the environment
Example: Q-learning creates a table of possible outcomes of possible actions in each state
Challenge: Existing optimization algorithms take large number of samples, so the cost for many real use-cases is unacceptably high
Sample Efficient Algorithms
Scenario: Transfer learning
Learn how to ride a tricycle and transfer the skills to riding a bicycle
Questions:
Which tasks are useful for which other tasks
How do you transfer knowledge across tasks?
Study: Transfer RL with Multiple Experts
Extend traditional RL with Bayesian Inference
Multiple experts advising the agent, their quality varies
Agent needs to understand which expert to trust
Extension: situation-aware transfer learning
Choose which expert to trust in a given region of the state space
E.g. Lunar lander navigation
Different agents are good at hovering, landing, etc.
Choice of which agent to trust depends on the current state
E.g. COVID shut-down policies
Experts: SIR models of disease spread trained on different COVID variants
Combined these experts to predict optimal policy for a new COVID variant
Idea: train RL policy in a simulation, then transfer policy to real system
There is a mismatch between the dynamics in the two settings
Approach:
Run a basic simulation and then adapt it to reduce its prediction error
Adaptation is an extra layer that takes simulation state transition as input and then takes another random state transition
Goal: learn a better state transition function
This can produce a more accurate simulation with fewer samples of real system (e.g. robot), which are expensive
We can then train our policy on this improved simulation.