Causal MoMa: Causal Policy Gradient for
Whole-Body Mobile Manipulation
Jiaheng Hu, Peter Stone, Roberto Martín-Martín
The University of Texas at Austin
Jiaheng Hu, Peter Stone, Roberto Martín-Martín
The University of Texas at Austin
RSS 2023 (Daegu, Korea) | Paper | Code | Tweet & Video
Causal MoMa Overview
Developing the next generation of household robot helpers requires combining locomotion and interaction capabilities, which is generally referred to as mobile manipulation(MoMa). MoMa tasks are difficult due to the large action space of the robot and the common multi-objective nature of the task, e.g., efficiently reaching a goal while avoiding obstacles. In this work, we introduce Causal MoMa, a new framework to train policies for typical MoMa tasks that makes use of the most favorable subspace of the robot’s action space to address each sub-objective. Causal MoMa automatically discovers the causal dependencies between actions and terms of the reward function and exploits these dependencies in a factored policy learning procedure that reduces gradient variance compared to previous state-of-the-art policy gradient algorithms, improving convergence and results.
Method
Two-step procedure in Causal MoMa for policy training in MoMa tasks with factored reward functions. Top: Causal MoMa infers the causal dependencies between reward terms and action dimensions through a causal discovery procedure on randomly collected data. Bottom: Causal MoMa trains a policy that generates whole-body action commands based on onboard sensor signals and task information, by exploiting the discovered Causal Matrix through factored policy gradient.
Experiments & Results
Training curves for Fetch (left) and HSR (right), five seeds each, mean and std. Causal MoMa consistently outperforms the baselines and achieves a higher return thanks to a reduced gradient variance with the causal policy gradient.
We evaluate the performance of Causal MoMa policies trained in simulation when transferred zero-shot to control a real robot, an HSR mobile manipulator, and compare against two baselines based on the sampling-based planner CBiRRT2, with and without replanning. Our method is able to outperform the baselines on all but one scenario (static goal with static obstacles) and has a significant advantage over the baselines in all the dynamic cases.
Please check out our paper to know more!