Xiaohai (Bob) Hu¹꠫², Bruce Huang², Xiao Chen³, Yang Chen², Xu Chen¹, Sean Yuan², Luke Jia², Chao Guo²
¹ University of Washington ² Google ³ Chinese University of Hong Kong
We introduce ARHumanoid, a full-stack framework designed to bridge the gap between high-level cognitive reasoning and high-frequency motor execution. The core of our approach is the decoupling of slow-rate semantic skill grounding from fast whole-body control. By utilizing a multimodal architecture, the system maps language instructions and egocentric visual context to discrete skill selections, while local reinforcement learning policies ensure dynamic stability and precise motion tracking.
To enable robust real-world deployment, we incorporate geometric contact-correction to align human motion references with physically feasible configurations, paired with contact-aware training for stable interactions. Our results demonstrate that this decoupled strategy provides a scalable and reliable template for deploying complex, semantically-guided behaviors on humanoid platforms.
System Overview: (a) Human video demonstrations are processed and retargeted to generate physically feasible robot reference motions. (b) The robot learns whole-body skills via reinforcement learning and deploys policies in real time. (c) An AR headset streams egocentric vision to a cloud LLM, which reasons and dispatches skills online.
"Move Black Box On Top of Blue Box"
"Pick and Place the Red box"
"Pick and Place the black Box"