Human Demonstration
→
Robot Manipulation
Despite the prevalence of transparent objects in everyday environments, transparent robotic manipulation research remains limited to short-horizon grasping with poor generalization to novel objects. To address this, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a framework that integrates transparent object perception, demonstration-based trajectory retargeting, and vision-language planning for precise long-horizon manipulation from natural language instructions. A key advantage of our method is its single-demonstration approach, which retargets object-centric 6D trajectories to novel transparent objects without category-level priors or additional training. We further present a task planner that refines VLM-generated plans under single-arm eye-in-hand constraints and composes 15 primitive actions derived from 6 human-demonstrated skill trajectories for long-horizon manipulation. Across 122 real-world trials on 6 tasks with 31 objects under varied lighting, occlusion, and tight-space scenarios, DeLTa significantly outperforms existing methods, particularly in long-horizon tasks requiring precise manipulation.
Overview of our DeLTa framework. our framework comprises three key components for precisely manipulating transparent objects in long-horizon tasks from the language instruction. By leveraging vision foundation models, 6D object trajectories are extracted from single-object demonstration videos. These trajectories transfer to novel objects through our last-inch motion planner without requiring additional training.
Place the cola on the coaster
Place the water on the coaster
Place the blue bottle on the coaster
Arrange the cola on the shelf
Arrange the cola and water on the shelf
Make turquoise liquid in the measuring bowl
Make green liquid in the cylinder
All experimental videos were played at 5× speed.