DR-HRL

A Dual Representation Framework for Robot Learning with Human Guidance

Ruohan Zhang*, Dhruva Bansal*, Yilun Hao*, Ayano Hiranaka, Jialu Gao, Chen Wang

Roberto Martín-Martín, Li Fei-Fei, Jiajun Wu (*equal contribution)

Stanford Vision and Learning Lab (contact: zharu@stanford.edu)

Conference on Robot Learning (CoRL) 2022

Paper Appendix

Summary

Learning interactively from human evaluation and preference calls for sample-efficient robot learning algorithms.
Learning is more efficient if the agent is augmented with a representation that aligns better with the human representation, in the form of symbolic scene graphs.
The robot uses this high-level representation to model human guidance behaviors, which enables the robot to actively query humans.
The robot keeps a low-level, fine-grained state and action space for learning continuous control policies using human guidance.

The Dual Representation Framework

The ability to interactively learn from human evaluation and human preference is important for robot learning. But human guidance is an expensive resource, calling for methods that can learn efficiently. We argue that learning is more efficient if the agent is equipped with a representation that aligns better with the human internal representation.

In this framework, the robotic learning agent uses a low-level, fine-grained state and action space for learning continuous control policies (evaluative feedback) or reward functions (preference learning) .
Meanwhile, the agent keeps a symbolic scene graph as a high-level representation of human internal states, in which objects are represented as nodes and relations between objects are represented as edges. The agent uses this high-level representation to actively query human trainers for their guidance during training.

We showcase this framework in five continuous control tasks shown below.

Shown above is the scene graph for placing tasks. This scene graph can be used to define an abstract state, which is a binary vector, where each dimension (True or False) represents a unary state of an object or a pairwise semantic relation between the objects.

Dual Representation-based Evaluative Feedback (DREF)

In evaluative feedback, we choose Deep TAMER+SAC as the backbone to learn the policy. In addition to predicting Q-values, we add another network that predicts human feedback value H(s,a).

This active learning problem can be formulated as a multi-armed bandit problem with abstract states.

For each abstract state g, DREF estimates on the upper confidence bound (UCB1) of human feedback prediction error (FPE).
FPE is calculated as the average feedback prediction error for all the low-level states encountered that belong to this abstract state.
DREF queries for feedback only in abstract states that have the largest UCB1 values (more uncertainty).

Results: Evaluative Feedback

Early on in Training

All algorithms struggle to learn this challenging task. Both DREF and EF-50% frequently stop and ask for human feedback.

ef1.mp4

ef2.mp4

Middle of Training

The SAC agent has only learned not to drop the ball. The EF-50% has achieved non-zero success rate, but still frequently queries humans for their feedback. Our algorithm, DREF, has learned a better policy, and only queries human in a few states where it is uncertain about human feedback.

End of Training

The SAC agent is still unable to achieve any success. The EF-50% agent has learned a suboptimal policy, and still frequently queries for human feedback. The DREF agent has learned a robust policy (with only 15.7% feedback) that can successfully perform the task every episode. It also stops asking for human feedback.

ef3.mp4

Cumulative rewards gained during training for 5 tasks. The percentage corresponds to the percentage of feedback provided by the oracle during training. The proposed algorithm, DREF, achieves better or comparable performance with much less feedback. Error bars indicate the standard error of the means.

Dual Representation-based Preference Learning (DRPL)

Challenges

Trajectories in preference queries could be too long – an apple-to-apple comparison is difficult to achieve
How do we segment and select trajectories to make both human decision-making and and robot reward learning easier?

Segmentation

DRPL uses scene graph to perform trajectory segmentation: abstract state transitions naturally define the starting and ending points of a meaningful segment.

Selection

Additionally, DRPL uses scene graph to perform trajectory selection: pick two trajectories with starting abstract states that only differ in one dimension. Hence the remaining dimensions are held constant, leading to a controlled comparison.

Results: Preference Learning

Full

Let’s see the queries selected by baseline methods first. Full trajectories are lengthy and difficult to compare.

Random

Randomly segmented trajectories are short; but are often meaningless hence difficult to compare.

DRPL-SS

DRPL-SS, which selects a pair of trajectories that start from the same abstract state, often selects trajectories that are very similar hence difficult to compare.

In contrast, queries selected by DRPL are more intuitive to humans, these unambiguous preference choices lead to better estimates of the reward function.

selecting 2

In the first pair of trajectories, the second one is preferred since it moves closer to the center.

selecting 1

In the second pair, the first trajectory is preferred since it moves towards the center region.

selecting 2

In the third pair, the second trajectory ends up in the target region and is preferred.

Reward alignment scores for 5 tasks. DRPL performs the best upon convergence. Error bars indicate the standard error of the means

Reference

Zhang, Ruohan, Dhruva Bansal, Yilun Hao, Ayano Hiranaka, Jialu Gao, Chen Wang, Roberto Martín-Martín, Li Fei-Fei, and Jiajun Wu. "A Dual Representation Framework for Robot Learning with Human Guidance." In 6th Annual Conference on Robot Learning.

@inproceedings{zhangdual, title={A Dual Representation Framework for Robot Learning with Human Guidance}, author={Zhang, Ruohan and Bansal, Dhruva and Hao, Yilun and Hiranaka, Ayano and Gao, Jialu and Wang, Chen and Mart{\'\i}n-Mart{\'\i}n, Roberto and Fei-Fei, Li and Wu, Jiajun}, booktitle={6th Annual Conference on Robot Learning}}