Abstract
Large-scale pre-training using videos has proven effective for robot learning, as it enables the model to acquire task knowledge from first-person human operation data that reveals how humans perform tasks and interact with their environment. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a simple data augmentation technique for robot pre-training from videos, which extracts the human hands from first-person videos and replaces them with those of different robots to generate new video data for pre-training. Specifically, we start by detecting the 3D position and key points of human hands, which serves as the basis for generating robots in the simulation environment that exhibit similar motion postures. Then, we calibrate the intrinsic parameters of the simulator camera to match the camera in the original video and render the images of generated robots. Finally, we overlay these images onto the original video to replace human hands. Such a procedure bridges the visual gap between the human hand and the robotic arm and produces an augmented dataset for pre-training. We conduct extensive experiments on a variety of robotic tasks, ranging from standard simulation benchmarks to robotic real-world tasks, with varying pre-training strategies, video datasets, and policy learning methods. The experimental results show that H2R can improve the representation capability of visual encoders pre-trained by various methods. In imitation learning, H2R consistently enhances the average success rate across different pre-training methods, with improvements ranging from 0.9% to 10.2%. The effect of this improvement is highly stable. In reinforcement learning, most pre-training methods show improvements. Our real-world evaluations across diverse manipulation tasks demonstrate that H2R-enhanced visual representations consistently outperform baseline models, achieving success rate improvements ranging from 6.7% to 15% across all model-task configurations.
H2R Pipeline
H2R involves replacing human hands with robot arms by first using the HaMeR model to detect hand poses and camera parameters. The human hand is then removed using the SAM, and the inpainting model LaMa fills in the gap. A robot hand is constructed based on the detected pose and keypoints, with the camera perspective adjusted to match the original image. Finally, the robot hand is overlaid onto the image, ensuring accurate alignment with the human hand.
Real-world Demonstrations
place_and_close_box
pick_and_place
stack_cube