ARHumanoid

ARHumanoid: Semantic Skill Grounding for Whole-Body Control

on a Real Humanoid

Xiaohai (Bob) Hu¹꠫², Bruce Huang¹, Xiao Chen³, Yang Chen¹, Xu Chen², Sean Yuan¹, Luke Jia¹, Chao Guo¹

¹ Google ² University of Washington ³ Chinese University of Hong Kong

Perception–Reasoning–Control Evaluation

"Move Black Box On Top of Blue Box"

"Pick and Place the Red box"

"Pick and Place the black Box"

Summary

We introduce ARHumanoid, a full-stack framework designed to bridge the gap between high-level cognitive reasoning and high-frequency motor execution. The core of our approach is the decoupling of slow-rate semantic skill grounding from fast whole-body control. By utilizing a multimodal architecture, the system maps language instructions and egocentric visual context to discrete skill selections, while local reinforcement learning policies ensure dynamic stability and precise motion tracking.

To enable robust real-world deployment, we incorporate geometric contact-correction to align human motion references with physically feasible configurations, paired with contact-aware training for stable interactions. Our results demonstrate that this decoupled strategy provides a scalable and reliable template for deploying complex, semantically-guided behaviors on humanoid platforms.

System Overview

System Overview: (a) Human video demonstrations are processed and retargeted to generate physically feasible robot reference motions. (b) The robot learns whole-body skills via reinforcement learning and deploys policies in real time. (c) An AR headset streams egocentric vision to a cloud LLM, which reasons and dispatches skills online.

Perception–Reasoning–Control Pipeline

Semantic Grounding Evaluation

Page updated

Google Sites

Report abuse