Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation.
Our key insights are:
Beyond the visual contexts, the retargeted human hand trajectories can serve as action labels for human demonstrations.
Creating intermediate domains via interpolation leads to robust adaptation.
Establishing an effective mapping between human and robot data for interpolation is essential for co-training.Â
We collect, map, and interpolate human and robot data through the following steps:
(a) Robot demonstrations are collected using visual teleoperation.
(b) Human actions are extracted and retargeted from videos.
(c, d) Using visual or action-based Dynamic Time Warping (DTW), we map the retargeted human and robot trajectories.
(e) Through MixUp, the mapped human-robot pairs are interpolated in both the latent space and the action space, to generate new interpolated human data for training.
Finally, the interpolated human data is co-trained alongside the robot data. See the diagram below for an overview of the co-training pipeline.
For robot demonstration, we train the policy using agent-view and wrist-view images encoded by ResNet along with proprioception. All are combined into the observation condition to predict future actions.Â
 For human demonstration, we train the same diffusion policy using human videos.
A hand pose retargeting module generates retargeted action which serve as both the future action and proprioception for training.Â
Mapping with DTW, we apply MixUp for human data with paired robot data.
The interpolation enables human data to smoothly adapt to the robot data.
The model is optimized upon the sum of reconstruction losses.
t-SNE visualization of input conditions at each timestep from human and robot datasets during training shows that ImMimic-A generates a smooth domain flow for the human data compared to Vanilla Co-Training, enabling effective domain adaptation.
We collect both human and robot demos for the four tasks.Â
For human demonstrations, we directly record videos with the agent-view camera.
For robot demonstrations, vision-based teleoperation is used to control the robot.
Human: Pick and Place
Human: Push
Human: Hammer
Human: Flip
Robotiq: Pick and Place
Robotiq: Push
Robotiq: Hammer
Robotiq: Flip
Fin Ray: Pick and Place
Fin Ray: Push
Fin Ray: Hammer
Fin Ray: Flip
Allegro: Pick and Place
Allegro: Push
Allegro: Hammer
Allegro: Flip
Ability: Pick and Place
Ability: Push
Ability: Hammer
Ability: Flip
We validate ImMimic on four diverse manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq Gripper, Fin Ray Gripper, Allegro Hand, Ability Hand). Below are the results for ImMimic-A (action-based mapping):
Blue indicates success, and Red indicates failure.
Robotiq: Pick and Place
Fin Ray: Pick and Place
Allegro: Pick and Place
Ability: Pick and Place
Robotiq: Push
Fin Ray: Push
Allegro: Push
Ability: Push
Robotiq: Hammer
Fin Ray: Hammer
Allegro: Hammer
Ability: Hammer
Robotiq: Flip
Fin Ray: Flip
Allegro: Flip
Ability: Flip
Success rates of Robot-Only, Co-Training, and ImMimic-A across four embodiments and four tasks. (100 human demos, 5 robot demos)
Blue indicates success, and Red indicates failure.
Co-Training
Robotiq: Pick and Place
Random Mapping
Robotiq: Pick and Place
Robot-Only
Robotiq: Pick and Place
ImMimic-A (Ours)
Robotiq: Pick and Place
Robotiq: Flip
Robotiq: Flip
Robotiq: Flip
Robotiq: Flip
Ability: Pick and Place
Ability: Pick and Place
Ability: Pick and Place
Ability: Pick and Place
Ability: Flip
Ability: Flip
Ability: Flip
Ability: Flip
Success rates of Co-Training, Random Mapping, Robot-Only and ImMimic-A. (100 human demos, 5 robot demos)
Blue indicates success, and Red indicates failure.
ImMimic-V
Robotiq: Pick and Place
ImMimic-A
Robotiq: Pick and Place
Robotiq: Flip
Robotiq: Flip
Ability: Pick and Place
Ability: Pick and Place
Ability: Flip
Ability: Flip
Success rates of ImMimic-V and ImMimic-A. (100 human demos, 5 robot demos)
Sample efficiency of ImMimic-A with 0, 50, 100, 200 human demonstrations on Pick and Place and Flip.
Sample efficiency of Robot-Only and ImMimic-A with 1, 5, 20 robot demonstrations on Pick and Place and Flip.