Semantics-aware Motion Retargeting with Vision-Language Models

 Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantic-aware Motion retargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantics embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.

Framework

New Cases (Updated)

whatever_0.5.mp4
react_0.5.mp4
hand_up0.5.mp4
cry_0.5.mp4
baseball_front_0.5.mp4
baseball_side_0.5.mp4

New Ablation Study: SAN+VLM

(Note that SAN neglects the finger movemnts.)

ablation_baseball_0.5.mp4
ablation_greeting_0.5.mp4
ablation_salute.mp4

New Ablation Study: Different Numbers of Views

1月29日 (2).mp4
1月29日.mp4
1月29日 (1).mp4

BLIP-2 Responses to Failure Cases of Baseline Methods

Comparison with State of the Arts

Greeting_final.mp4
Pitching_final.mp4
Salute_final.mp4

Ablation Studies

ablation_final.mp4
feature_ablation.mp4

Video Motion Retargeting

video.mp4

More Cases

react.avi
crazy.avi
clapping.avi
fireball.avi