# Hierarchical Policy Blending As Optimal Transport

## An T. Le, Kay Hansel, Jan Peters, Georgia Chalvatzaki

## Abstract

We present hierarchical policy blending as optimal transport (HiPBOT). This hierarchical framework adapts the weights of low-level reactive expert policies, adding a look-ahead planning layer on the parameter space of a product of expert policies and agents. Our high-level planner realizes a policy blending via unbalanced optimal transport, consolidating the scaling of underlying Riemannian motion policies, effectively adjusting their Riemannian matrix, and deciding over the priorities between experts and agents, guaranteeing safety and task success. Our experimental results in a range of application scenarios from low-dimensional navigation to high-dimensional whole-body control showcase the efficacy and efficiency of HiPBOT, which outperforms state-of-the-art baselines that either perform probabilistic inference or define a tree structure of experts, paving the way for new applications of optimal transport to robot control. The implementation will be released later.

I. Tiago++ Whole-Body Control Videos

The speeds of all videos are not modified and the experiments are recorded as is. We demonstrate HiPBOT versus RMPflow capabilities in the MEMA setting with a high-dimensional, multi-objective and highly dynamic environment, where the TIAGo++ must track two potentially conflicting reference trajectories while avoiding self-collision and an obstacle.

Video 1: Demonstration runs of HiPBOT(h=2). HiBPOT is able to compromise between objectives thanks to its ability to adapt expert priorities online.

Video 2: Demonstration runs of RMPflow. RMPflow struggles to find good situational actions and eventually collides.

II. Other Exemplar Cases

Video 3: (Left) Demonstration run of HiPBOT on extremely dynamic and dense maze environment. (Right) Demonstration run of HiPBOT on Panda case, with dynamic obstacles hindering the way to the green box.

III. HiPBOT tuning tips

For highly dynamic environments, there is a trade-off between a good planning rate and good exploratory horizon length. Depending on the spectrum of either more dynamic obstacles or more difficult local minima, the horizon length could be ranging from 5-15 for dynamic cases and from 20-40 for hard local minima cases.

The cost matrix also plays a big role in the stability of the Sinkhorn-Knopp algorithm, we have to tune the cost weights so that the cost magnitude is at the value scale of ~0-10, as anything larger than 20 we found the underflow of temperature starts to happen due to exponential terms inside Sinkorn-Knopp algorithm. Balancing magnitudes of cost components such as goal cost and collision avoidance is also recommended.

In most cases for our experiments, we simply set the KL regularization scalar of the entropic-regularized UOT $\lambda_{KL} = 1.0$ for 5-10 experts, and reduce it gradually to 0.1 for even more relaxing the normalizing constraint if we have larger than 10 experts and more agents. The entropic regularization scalar $lambda$ can be set ranging from 0.1-1.0 depending on how "blur" the temperature vector we want (higher is blurrier but faster to optimize, and vice versa).

The prior temperature vectors can be set either as one-vectors or uniform histograms, in case we prioritize the experts equally a-priori. Otherwise, the prior temperature values can be in the value range of 0.1-10 for each expert.