SADiff

Skill-Aware Diffusion for Generalizable Robotic Manipulation

Anonymous Submission

Dataset

Code (Coming Soon)

SADiff Framework

Overview of the proposed Skill-Aware Diffusion (SADiff). The pipeline is structured into three distinct phases: (1) The encoding phase, where the skill-aware encoding module utilizes learnable skill tokens to interact with multimodal inputs and extract skill-specific information; (2) The generation phase, in which a skill-constrained diffusion model generates object-centric motion flow conditioned on the skill-aware token sequences, optimized by both denoising and two skill-specific auxiliary losses; and (3) The execution phase, which employs a skill-retrieval transformation strategy to translate the generated 2D motion flow into executable 3D trajectories by leveraging skill-specific priors.

Section A：Skill-Aware Encoding Module

To effectively integrate skill-specific information with multimodal inputs, we designed a skill-aware encoding module. The skill-aware encoding module integrates image, language, and bounding boxes of relevant objects with learnable skill tokens through attention-based interactions, producing skill-aware token sequences.

Section B：Skill-Constrained Flow Generation

To generate a precise 2D object motion flow aligned with a specific skill, we propose a skill-constrained diffusion model. The diffusion model generates motion flow by jointly optimizing skill classification loss, skill contrastive loss, and denoising loss to ensure accurate skill selection, semantic alignment, and precise flow reconstruction.

Section C：Retrieval-Enhanced Transformation

To achieve an accurate transformation from 2D flow to executable 3D actions, we introduce skill-specific trajectory priors into the optimization framework, leveraging them as high-level constraints to guide the optimization toward skill-consistent motion patterns with improved accuracy and physical consistency.

Experiment Demonstration

Simulation Experiments

1. Within-Distribution Experiment