Feedback-efficient Active Preference Learning for Socially Aware Robot Navigation

Ruiqi Wang, Weizheng Wang, and Byung-Cheol Min

Accepted at IROS 2022

Abstract

Socially aware robot navigation, where a robot is required to optimize its trajectory to maintain comfortable and compliant spatial interactions with humans in addition to reaching its goal without collisions, is a fundamental yet challenging task in the context of human-robot interaction. While existing learning-based methods have achieved better performance than the preceding model-based ones, they still have drawbacks: reinforcement learning depends on the handcrafted reward that is unlikely to effectively quantify broad social compliance, and can lead to reward exploitation problems; meanwhile, inverse reinforcement learning suffers from the need for expensive human demonstrations. In this paper, we propose a feedback-efficient active preference learning approach, FAPL, that distills human comfort and expectation into a reward model to guide the robot agent to explore latent aspects of social compliance. We further introduce hybrid experience learning to improve the efficiency of human feedback and samples, and evaluate benefits of robot behaviors learned from FAPL through extensive simulation experiments and a user study (N=10) employing a physical robot to navigate with human subjects in real-world scenarios.

Framework of FAPL

The FAPL is composed of two parts: 1) Hybrid Experience Learning (left) and 2) Agent Learning (right). The robot agent starts from curious exploration, where it is encouraged to take different actions and reach diverse states by a maximum state-entropy-based primitive reward. The collected exploration experiences will be stored in a replay buffer with the expert experiences from human demonstrations. Then human teachers will express references for a pair of robot navigation trajectories in the buffer, based on which a reward model will be learned. Then all the samples will be updated with a new reward value from every new generation reward model. At last, the robot agent will utilize the updated samples to optimize a policy which gains maximum return from the model distilled from human preferences through off-policy RL.

Learning Curves

Learning curves of CADRL, RGL, SARL, APL and FAPL as measured, on (left) success rate (each episode) and (right) discomfort frequency (the mean across each five hundred episodes).

Simulation Experiments

We compare the proposed FAPL with other four state-of-the-art methods: ORCA as the baseline of model-based methods; CADRL, RGL, SARL as the baselines of learning-based methods, and one ablation model: APL, which removes hybrid learning module from FAPL. All models are trained with 10,000 episodes.

ORCA CADRL RGL

SARL APL FAPL

Real-World Experiments

Since the social compliance is non-quantifiable and broader than the comfortable distance, it is unsatisfied and insufficient to evaluate it only via the indicator of discomfort frequency in simulation experiments. To intuitively and further evaluate the social compliance of learned robot trajectories from our methods, we recruited human participants to conduct real-world experiments, collecting their feedback from experiences of walking with a robot controlled by different models as another indicator.

The ablation model APL and SARL, the best performing baseline in terms of discomfort frequency in simulation, were selected as the baselines for real-world experiments.

Note: The robot is enforced to stop when the action from its policy contains a turning degree greater than 90 degrees.