Learning-based approaches often outperform hand-coded algorithmic solutions for many problems in robotics. However, learning long-horizon tasks on real robot hardware can be intractable, and transferring a learned policy from simulation to reality is still extremely challenging. We present a novel approach to model-free reinforcement learning that can leverage existing sub-optimal solutions as an algorithmic prior during training and deployment. During training, our gated fusion approach enables the prior to guide the initial stages of exploration, increasing sample-efficiency and enabling learning from sparse long-horizon reward signals. Importantly, the policy can learn to improve beyond the performance of the sub-optimal prior since the prior's influence is annealed gradually. During deployment, the policy's uncertainty provides a reliable strategy for transferring a simulation-trained policy to the real world by falling back to the prior controller in uncertain states. We show the efficacy of our Multiplicative Controller Fusion approach on the task of robot navigation and demonstrate safe transfer from simulation to the real world without any fine tuning.
Multiplicative Controller Fusion Overview
Multiplicative Controller Fusion (MCF) provides a strategy to effectively leverage classical controllers to aid reinforcement learning agents both during training and deployment. MCF focuses on the stochastic policy and controller setting where the output from these systems is represented as a conditional distribution over possible action given a state. The general form of MCF is given as follows:
We show how this composite distribution can be leveraged both during training and deployment of reinforcement learning agents, exhibiting superior performance in both domains when compared to the typical reinforcement learning setup; assuming no prior knowledge about a given task.
Training: Guided Exploration
Exploration is difficult in sparse long horizon settings for standard reinforcement learning techniques, requiring large amounts of environment interaction. To this end, we leverage the prior controller during training to guide exploration. We utilise a gated variant of MCF general form for Gaussian exploration using the composite distribution. The gating function biases the composite distribution towards the prior early on during training, exposing it to the most relevant parts of the state-action space. As the policy becomes more capable, the gating function gradually shifts the composite distribution towards the policy distribution by the end of training. This allows the system to fully exploit its learned policy and improve beyond the prior. The multiplicative fusion constrains the exploration such that the policy does not deviate far off from the prior, exploiting unwanted behaviours. The gated MCF variant used during training is given below:
We show the impact of the gating on the multiplicative composition as training progresses in the Figure below.
The Figure below illustrates the state space coverage during exploration by standard Gaussian exploration, the baseline approach and MCF in an environment with a fixed start and goal location. This illustrates the poor performance of standard Gaussian exploration in sparse long horizon reward settings incapable of moving far beyond its initial position. The baseline, whilst benefiting from the demonstrations is seen to spend time exploring unnecessary regions of the state space. MCF on the other hand illustrates structured exploration around the deterministic path of the prior controller (indicated by the dashed line) allowing it to focus on parts of the state space most relevant to the task, whilst exploring the surrounding state-action regions for potential improvements.
Deployment: Uncertainty-Aware Navigation
At deployment we represent the policy distribution over actions using an estimate of its epistemic uncertainty and utilise Monte Carlo sampling to derive a distribution over the prior controller actions based on the noise present in the laser scanner of the robot. This allows us utilise the general form of MCF to fuse these two distributions such that the resulting distribution biases towards the controller exhibiting the least uncertainty in a given state. This allows the system to demonstrate the complex navigational skills attained by the learned system whilst exhibiting the risk averse behaviours of the prior in states of high policy uncertainty
The trajectories achieved by MCF and both the prior and policy alone are shown below: