In this work we focus on improving the efficiency and generalisation of learned navigation strategies when transferred from its training environment to previously unseen ones. We present an extension of the residual reinforcement learning framework from the robotic manipulation literature and adapt it to the vast and unstructured environments that mobile robots can operate in. The concept is based on learning a residual control effect to add to a typical sub-optimal classical controller in order to close the performance gap, whilst guiding the exploration process during training for improved data efficiency. We exploit this tight coupling and propose a novel deployment strategy, switching Residual Reactive Navigation (sRNN), which yields efficient trajectories whilst probabilistically switching to a classical controller in cases of high policy uncertainty. Our approach achieves improved performance over end-to-end alternatives when directly transferred to the real world, and can be incorporated as part of a complete navigation stack for cluttered indoor navigation tasks in the real world.
The task requires a reactive navigation agent, with no prior knowledge of the environment, to successfully avoid obstacles and reach a target goal in the shortest time possible. The agent perceives the environment utilising a 2D laser scanner and global positioning of the robot and goal is assumed to be known at all times.
We utilise a sub-optimal prior controller, upon which we close the performance gap by learning an additive residual action which modifies the prior's action. The prior in this work was an Artificial Potential Fields controller for reactive navigation. This coupled system is trained in simulation utilising the residual reinforcement learning framework. The resulting action from this system is termed the hybrid action (prior + residual) which exhibits dextrous behaviours, difficult to hand-craft, similar to those attained by learning the entire task end-to-end. It additionally requires significantly lower training samples as opposed to learning the task end-to-end as the exploration process is guided by the prior. The policy is trained with a sparse reward signal, whose difficulty is alleviated by this guided exploration, and motivates the system to learn the most efficient navigation strategy. As the entire system is trained in simulation and transferred to the real world, we mitigate the chances of failure, due to poor policy generalisation to states in the real world, by exploiting this tight coupling between the two systems. We extract an epistemic uncertainty measure from the policy for a given state and stochastically switch to the prior in cases of high policy uncertainty. Note that the prior, whist being suboptimal, is still capable of achieving the underlying task. The stochastic switching removes the need for a hard threshold and enabled the system to navigate over a large cluttered indoor environment successfully without any fine-tuning.
The simulation environments used for training the agent are shown below. The environment was created using Box2D, a 2-dimensional physics simulation engine which is suited for efficiently learning policies using reinforcement learning. We provide this environment for public use on our Github page. Note that the environment does not capture all the intricacies of the various obstacle forms that the agent may encounter in the real world.
We deploy the simulation trained system on a PatrolBot mobile base and integrate it as part of the local planner within the ROS navigation stack for large scale indoor navigation. The trajectories taken by the 3 different systems (sRRN, prior only, end-to-end) are shown below.
sRRN (Ours) Prior Only
Trajectories showing sRRN operating in the real world vs. the prior executed alone. Orange ROI: note the oscillatory behaviour of the prior only system as indicated by the more pronounced purple trajectory. The residual in the hybrid action learned to mitigate this behaviour yielding a smoother trajectory as indicated by the red regions in the trajectory. This resulted in lower execution times for a given trajectory.
sRRN (Ours) End-to-end
Trajectories showing sRRN operating in the real world vs. an end-to-end trained policy. Blue ROI: note that the typical regions of end-to-end policy failure were regions where the sRRN system fell back onto the prior indicating regions of higher uncertainty. Both these system exhibited similar traits in these regions given their exposure to the same training environments in simulation. The switching behaviour allowed the system to move beyond this point and continue navigating with the hybrid in cases of lower uncertainty.
The authors would like to thank Vibhavari Dasagi for the TD3 implementation used in this work, Jake Bruce for the development of the simulation environment, Robert Lee and Serena Mou for valuable and insightful discussions.
@misc{rana2019residual,    title={Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies For Deployment in Unknown Environments},    author={Krishan Rana and Ben Talbot and Michael Milford and Niko Sünderhauf},    year={2019},    eprint={1909.10972},    archivePrefix={arXiv},    primaryClass={cs.RO}}