Daniel Shin, Daniel S. Brown, Anca Dragan
TL;DR: We propose a novel framework, Offline Preference-based Apprenticeship Learning (OPAL), for the offline reinforcement learning setting where we assume access to a dataset of transitions but no rewards. We first query from the dataset to get human preferences which are then used to train a reward function. Using this reward function, we relabel all transitions in the offline dataset with rewards. Finally, we use offline RL to optimize a policy. Importantly, the entire OPAL framework is completely offline and no interaction with the environment is needed.
Given an offline dataset of behavior agnostic data, we wanted to see whether we could optimize different policies, corresponding to different human preferences. We optimized the following policies: balance, where the supervisor prefers the pole to be balanced upright and prefers the cart to stay in the middle of the track; clockwise windmill, where the supervisor prefers the pole to swing around as fast as possible in the clockwise direction; and counterclockwise windmill, which is identical to clockwise windmill except the preference is for the pole to swing in the counterclockwise direction. OPAL is able to learn policies for all three behaviors.
We took the Maze2d-Open environment and dataset and used it to teach an agent to patrol the domain in counterclockwise orbits. The dataset only contains the agent moving to randomly chosen goal locations, thus this domain highlights the benefits of stitching together data from an offline database as well as the benefits of learning a shaped reward function rather than simply using a goal classifier to use as the reward function.
We created a new variant of the Maze2d-Medium task but where there is a constraint region in the middle of the maze that the supervisor does not want the robot to enter. The below figure shows this scenario where the highlighted yellow region is traversable by the agent, but this behavior is undesirable. After only 25 active queries using ensemble disagreement we obtain the behavior shown below where the offline RL policy has learned to reach the goal while avoiding the constraint region.