Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition

Justin Fu, Avi Singh, Dibya Ghosh, Larry Yang, Sergey Levine

University of California, Berkeley

{justinfu, avisingh, dibyaghosh, larrywyang, svlevine}@berkeley.edu

(* equal contribution)

Abstract: The design of a reward function often poses a major practical challenge to real-world applications of reinforcement learning. Approaches such as inverse reinforcement learning attempt to overcome this challenge, but require expert demonstrations, which can be difficult or expensive to obtain in practice. We propose variational inverse event control (VICE), which generalizes inverse reinforcement learning methods to cases where full demonstrations are not needed, such as when only samples of desired goal states are available. Our method is grounded in an alternative perspective on control and reinforcement learning, where an agent's goal is to maximize the probability that one or more events will happen at some point in the future, rather than maximizing cumulative rewards. We demonstrate the effectiveness of our methods on continuous control tasks, with a focus on high-dimensional observations like images where rewards are hard or even impossible to specify.

Arxiv

Event Based Control

In event-based control, we replace the traditional notion of reward with events, which are binary random variables denoting the occurrence of some desired outcome* (such as reaching a goal location, or maintaining a safety constraint). To obtain a control policy, we can condition on the event variables and perform an inference query on the actions. For example, traditional reinforcement learning most resembles the query in which we condition on the event happening at all timesteps (see * below for prior work). However, we can also ask the model to select actions according to the event happening on at least one timestep, or at a specific timestep, etc.

* A similar model has been used in previous work (such as Ziebart '10, Rawlik '12, Kappen '09, Toussaint '09, ...) to draw connections between inference and control.

Event Queries: ANY vs ALL Queries

We believe two queries to be particularly useful:

The ALL query, which asks for the event to happen at all timesteps. This can be useful for maintaining some desired configuration such as a robot maintaining balance, or enforcing a safety constraint.
The ANY query, which asks for the event to happen at at least once, on any timestep. This can be useful for achieving some specified goal such as navigating to a location or accomplishing a task.

We optimized two policies using TRPO according to each query on a robotic lobbing task, where the robot must throw a blue block to a pink goal. We immediately see qualitative difference between the two policies - the ANY query tends to throw the block through the goal, which maximizes the chance that the block reaches the goal location at least once. The ALL query performs a short toss which leaves the block near the goal at all timesteps but tends to not reach the goal as frequently as the ANY query.

ANY Query

Avg Distance: 0.61

Min Distance: 0.25

ALL Query

Avg Distance: 0.59

Min Distance: 0.36

VICE: Variational Inverse Control with Events

Traditional inverse reinforcement learning (IRL) allows us to automatically construct reward functions when they are difficult to build by hand, such as when using complicated observations like images. However, it also requires full demonstrations of a task, meaning we need to already know how to perform a task. One workaround is to gather instances of a desired outcome and train a classifier to detect your goal - however, this comes with its own issues such as how you should mine negatives, or balance your dataset, etc. Moreover, a clever RL agent might learn to maximize the classifier reward without actually achieving our desired objective.

In the event-based framework, we can formalize this problem as learning the event probability, and we require data corresponding the states and actions when the event occurs. We see that VICE is able to learn policies that correspond to our true objective (of the pushing the block to the target), while the pre-trained classifier baseline maximizes the log-probability objective to its limit without achieving our desired goal, which indicates that naive classifiers can easily lead to task misspecification. The binary event indicator baseline (which observes the true event, and is similar to RL from sparse rewards) is able to learn the desired behavior, but is significantly less sample efficient, and requires heavy supervision (labels that indicate whether an event happened or not for every state visited). All videos shown here are after training for 1000 iterations.

VICE (our method)

Naive Classifier