ImMimic-2025

ImMimic: Cross-Domain Imitation from Human

Videos via Mapping and Interpolation

Yangcen Liu, Woo Chul Shin, Yunhai Han, Zhenyang Chen,

Harish Ravichandar, Danfei Xu

Georgia Institute of Technology

CoRL 2025 (Oral)

[Paper] [Code to be released soon]

Abstract

Learning robot manipulation from abundant human videos offers a scalable alternative to costly robot-specific data collection. However, domain gaps across visual, morphological, and physical aspects hinder direct imitation. To effectively bridge the domain gap, we propose ImMimic, an embodiment-agnostic co-training framework that leverages both human videos and a small amount of teleoperated robot demonstrations. ImMimic uses Dynamic Time Warping (DTW) with either action- or visual-based mapping to map retargeted human hand poses to robot joints, followed by MixUp interpolation between paired human and robot trajectories. Our key insights are (1) retargeted human hand trajectories provide informative action labels, and (2) interpolation over the mapped data creates intermediate domains that facilitate smooth domain adaptation during co-training. Evaluations on four real-world manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq, Fin Ray, Allegro, Ability) show that ImMimic improves task success rates and execution smoothness, highlighting its efficacy to bridge the domain gap for robust robot manipulation.

Video

Our key insights are:

Beyond the visual contexts, the retargeted human hand trajectories can serve as action labels for human demonstrations.
Creating intermediate domains via interpolation leads to robust adaptation.
Establishing an effective mapping between human and robot data for interpolation is essential for co-training.

Method

We collect, map, and interpolate human and robot data through the following steps:

(a) Robot demonstrations are collected using visual teleoperation.
(b) Human actions are extracted and retargeted from videos.
(c, d) Using visual or action-based Dynamic Time Warping (DTW), we map the retargeted human and robot trajectories.
(e) Through MixUp, the mapped human-robot pairs are interpolated in both the latent space and the action space, to generate new interpolated human data for training.

Finally, the interpolated human data is co-trained alongside the robot data. See the diagram below for an overview of the co-training pipeline.

For robot demonstration, we train the policy using agent-view and wrist-view images encoded by ResNet along with proprioception. All are combined into the observation condition to predict future actions.
For human demonstration, we train the same diffusion policy using human videos.
- A hand pose retargeting module generates retargeted action which serve as both the future action and proprioception for training.
- Mapping with DTW, we apply MixUp for human data with paired robot data.
- The interpolation enables human data to smoothly adapt to the robot data.
The model is optimized upon the sum of reconstruction losses.

t-SNE visualization of input conditions at each timestep from human and robot datasets during training shows that ImMimic-A generates a smooth domain flow for the human data compared to Vanilla Co-Training, enabling effective domain adaptation.

Data Collection

We collect both human and robot demos for the four tasks.

For human demonstrations, we directly record videos with the agent-view camera.
For robot demonstrations, vision-based teleoperation is used to control the robot.

Human: Pick and Place

Human: Push

Human: Hammer

Human: Flip

Robotiq: Pick and Place

Robotiq: Push

Robotiq: Hammer

Robotiq: Flip

Fin Ray: Pick and Place

Fin Ray: Push

Fin Ray: Hammer

Fin Ray: Flip

Allegro: Pick and Place

Allegro: Push

Allegro: Hammer

Allegro: Flip

Ability: Pick and Place

Ability: Push

Ability: Hammer

Ability: Flip

Exmperiments

We validate ImMimic on four diverse manipulation tasks (Pick and Place, Push, Hammer, Flip) across four robotic embodiments (Robotiq Gripper, Fin Ray Gripper, Allegro Hand, Ability Hand). Below are the results for ImMimic-A (action-based mapping):

Blue indicates success, and Red indicates failure.