Utsav Singh*, Wesley A. Suttle, Brian M. Sadler, Vinay P. Namboodiri, Amrit Singh Bedi
*IIT Kanpur, U.S. Army Research Laboratory, University of Texas, University of Bath, UCF
In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.
The higher level policy predicts subgoals for the lower primitive, which executes actions on the environment. We propose to learn a preference-based reward model using our PiL feedback on higher level trajectories sampled from higher level replay buffer, and subsequently use to relabel the replay buffer transitions, thereby mitigating non-stationarity in HRL.
Primitive-in-the-loop (PiL) scheme
We propose (PiL) scheme to autonomously determine preferences between trajectories without relying on human input.
Primitive informed regularisation
We propose primitive informed regularisation to generate efficient feasible subgoals for the lower primitive.
HER and Target Networks
We use Hindsight Experience Replay to densify preference-based rewards, and use target networks to mitigate training instability due to non-stationary preference reward.
Experimental Results
Environments
We perform experiments on five sparse reward environments: Maze navigation, Pick and Place, Push, Hollow, and Franka Kitchen.
Baseline Comparison:
This figure compares the success rate performances of PIPER on sparse maze navigation and robotic manipulation environments. The solid line and shaded regions represent the mean and standard deviation, across 5 seeds. We compare our approach against multiple baselines. PIPER shows impressive performance and significantly outperforms the baselines.
Maze Navigation
Pick and Place
Push
Hollow
Franka Kichen