- Reward and goal specification in deep reinforcement learning and robotics remains a major challenge.
- Learn a distance function that represents the number of environment steps required to reach a specific state in MDP.
- Use the distance function for dense learning signal and goal setting.
- Dynamical distance learning from preferences (DDLfP): Given a goal, use distance as a reward signal for training goal-reaching policies.
- Dynamical distance learning - unsupervised (DDLUS): Use distance as a way of setting the goals. Learn policy using DDL.
Dynamical Distance Learning (DDL)
1. Sample trajectories from the environment
Sample a trajectory 𝜏 using the current policy π and save it in a replay pool.
2. Train the distance function
Update the distance function by minimizing the distance loss. In the following experiments, we use mean squared error.
3. Choose new goal
Choose a goal state from the recent experience buffer D. Below, we propose two different strategies for choosing goal: semi-supervised (DDLfP) and unsupervised (DDLuS).
4. Improve policy
Update the policy π by minimizing the policy loss. The update depends on the Reinforcement Learning algorithm of our choice. In practice, we use Soft Actor-Critic.
DDLfP: Dynamical Distance Learning from Preferences
A human operator chooses the most preferred state among already seen states. Inferring only the goal state requires much fewer operator queries than inferring the complete reward function (Christiano et al., 2017).
DDLuS: Dynamical Distance Learning - Unsupervised
Many challenging skills amount to reaching a distant goal state. Simple target proposal heuristics often result in exploration of useful skills. DDLuS can acquire effective running gaits and pole balancing skills in a variety of simulated settings by choosing the goal to be the furthest explored state under the current distance function:
Vision-Based Manipulation from Human Preferences
Finally, we applied DDLfP to a real-world vision-based robotic manipulation task in the a 9-DoF “DClaw” hand domain. The manipulation task requires the hand to rotate a valve 180 degrees. The operator is queried for a new goal every 10K environment steps. Both the vision- and state-based experiments with the real robot using 10 human queries during a 8 hour training period, which is low enough that the queries could easily be provided by a human user. To make the experiment systematically reproducible and easier to run, we evaluate the queries automatically by choosing the state with the lowest angle error to the goal, though an extension to use queries from a real human user would be straightforward.
Rollouts from a DDLfP policy trained in DClaw environment. The goals are chosen by querying a human operator for a preference once every 10K environment steps.
Observations seen by the human operator during the preference queries.
A series of observations shown to the human operator during the preference queries are presented on the left. Each image row presents the set of images shown to the human operator on a single query round. On each row, the first 10 images correspond to the last states of the most recent rollouts and the right-most image corresponds to the last goal. For each query, the human operator picks a new goal by inputting its index (between 0-10) into a text-based interface. The goals selected by human are highlighted with white borders.
Learning Locomotion from Preferences
We can guide the learning to desired direction by incorporating human chosen goals to the distance learning loop. This result is substantially better than prior methods based on learning from preferences. In (Christiano et al., 2017), the authors learned a reward function based on preferences and used it to learn a policy. Their method required 750 preference queries, whereas we used 100 for all the tasks.
All the videos available at this URL.
Rollouts from a policy trained with DDLfP in Hopper-v3 environment using goals provided by computational expert. Preferences guide the exploration into preferred direction (in this case, into the direction of positive x-axis). Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLfP in Ant-v3 environment using goals provided by computational expert. Preferences guide the exploration into preferred direction (in this case, into the direction of positive x-axis). Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLUS in Ant-v3 environment without supervision.