You Only Live Once: Single-Life Reinforcement Learning
Annie S. Chen^, Archit Sharma^, Sergey Levine', Chelsea Finn^
Stanford University^, UC Berkeley'
Twitter thread | Paper | Code
Neural Information Processing Systems (NeurIPS) 2022
Single-Life Reinforcement Learning
Motivation: RL algorithms are designed to learn a performant policy that can repeatedly complete a task, but many real-world situations involve on-the-fly adaptation that requires solving a task successfully once without interventions.
Example: A disaster relief robot tasked with retrieving an item has a single trial to complete its mission and may encounter novel obstacles in a previously-experienced building.
We model these situations with the following problem setting:
Utilizing some prior data, the agent is given a single “life”, i.e. trial, to autonomously adapt to a novel scenario to complete a task once.
We call this problem setting single-life reinforcement learning (SLRL). SLRL provides a natural setting in which to study autonomous adaptation to novel situations in reinforcement learning.
Core challenge of SLRL: Adapting to the novelty online in the agent's single life requires recovering from out-of-distribution states without interventions.
When informative shaped rewards are unavailable, as is the case in many real-world situations, episodic RL methods will not encourage the agent to recover from out-of-distribution states.
Below, we find that the novelty introduced by the hurdles leads to the cheetah agent getting stuck when running RL fine-tuning within a single episode, after pre-training on data from an environment without hurdles:
Experimental Domains
Tabletop-Organization
Task: Move mug to goal position
Target env: New initial mug positions
Pointmass
Task: Navigate from (0, 0) to (100, 0)
Target env: Include "Wind" (in y-direction)
Cheetah
Source task: Run a certain distance
Target task: Include hurdles
Kitchen
Source task: Close the cabinet or microwave
Target task: Close both
How to approach single-life RL?
Resets in episodic RL prevent algorithms from needing to recover, whereas single-life RL demands the agent find its way back to good states on its own.
One potential option for providing the desired guidance is to use reward shaping towards the agent's distribution of prior experience: Adversarial Imitation Learning (AIL) approaches can provide such reward shaping, but have two shortcomings in the SLRL setting:
Assume expert demos.
Aims to match the entire distribution of prior data.
We propose Q-Weighted Adversarial Imitation Learning (QWALE).
Key idea: weight states in the prior data by estimated Q-value.
Incentivizes recovering from out-of-distribution states by guiding the agent towards states in the prior data with Q values.
Distribution matching is just one possible class of approaches for this problem setting. QWALE provides a future baseline for developing algorithms that can better adapt to novelty online and recover from out-of-distribution states.
Does QWALE help agents learn to recover from novel situations?
Key Takeaway: agent gets stuck w/ SAC finetuning but is guided towards the prior data and task completion with QWALE.
In the Cheetah environment, the agent is able to better recover towards the prior data and move towards goal with QWALE, as seen below: