Dynamical Distance Learning for
Semi-Supervised and Unsupervised Skill Discovery
Anonymous Author(s)
Under Review for the Eighth International Conference on Learning Representations (ICLR 2020)
Anonymous Author(s)
Under Review for the Eighth International Conference on Learning Representations (ICLR 2020)
Sample a trajectory 𝜏 using the current policy π and save it in a replay pool.
Update the distance function by minimizing the distance loss. In the following experiments, we use mean squared error.
Choose a goal state from the recent experience buffer D. Below, we propose two different strategies for choosing goal: semi-supervised (DDLfP) and unsupervised (DDLuS).
Update the policy π by minimizing the policy loss. The update depends on the Reinforcement Learning algorithm of our choice. In practice, we use Soft Actor-Critic.
A human operator chooses the most preferred state among already seen states. Inferring only the goal state requires much fewer operator queries than inferring the complete reward function (Christiano et al., 2017).
Many challenging skills amount to reaching a distant goal state. Simple target proposal heuristics often result in exploration of useful skills. DDLUS can acquire effective running gaits and pole balancing skills in a variety of simulated settings by choosing the goal to be the furthest explored state under the current distance function:
We applied DDLfP to a real-world vision-based robotic manipulation task in the a 9-DoF “DClaw” hand domain. The manipulation task requires the hand to rotate a valve 180 degrees. The operator is queried for a new goal every 10K environment steps. Both the vision- and state-based experiments with the real robot using 10 human queries during a 8 hour training period.
Rollouts from a DDLfP policy trained in DClaw environment. The goals are chosen by querying a human operator for a preference once every 10K environment steps.
Observations seen by the human operator during the preference queries.
A series of observations shown to the human operator during the preference queries are presented on the left. Each image row presents the set of images shown to the human operator on a single query round. On each row, the first 10 images correspond to the last states of the most recent rollouts and the right-most image corresponds to the latest goal. For each query, the human operator picks a new goal by inputting its index (between 0-10) into a text-based interface. The goals selected by human are highlighted with white borders.
We can guide the learning to desired direction by incorporating human-chosen goals to the distance learning loop. This result is substantially better than prior methods based on learning from preferences. In (Christiano et al., 2017), the authors learned a reward function based on preferences and used it to learn a policy. Their method required 750 preference queries, whereas we used 100 for all the tasks.
All the videos available at this URL.
Rollouts from a policy trained with DDLfP in Hopper-v3 environment using goals provided by computational expert. Preferences guide the exploration into preferred direction (in this case, into the direction of positive x-axis). Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLfP in HalfCheetah-v3 environment using goals provided by computational expert. Preferences guide the exploration into preferred direction (in this case, into the direction of positive x-axis). Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLfP in Ant-v3 environment using goals provided by computational expert. Preferences guide the exploration into preferred direction (in this case, into the direction of positive x-axis). Videos for the rest of the seeds are available at this URL.
We trained DDLUS in four OpenAI gym environments.
All the videos available at this URL.
Rollouts from a policy trained with DDLUS in Hopper-v3 environment without supervision. 4/10 randomly chosen seeds. Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLUS in HalfCheetah-v3 environment without supervision. 4/5 randomly chosen seeds. Videos for the rest of the seeds are available at this URL.
Rollouts from a policy trained with DDLUS in Ant-v3 environment without supervision. 4/5 randomly chosen seeds. Videos for the rest of the seeds are available at this URL.