Human Demonstration
→
Robot Manipulation
Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities. Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a VLM-based planner that refines planning for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities.
Overview of our DeLTa framework. our framework comprises three key components for precisely manipulating transparent objects in long-horizon tasks from the language instruction. By leveraging vision foundation models, 6D object trajectories are extracted from single-object demonstration videos. These trajectories transfer to novel objects through our last-inch motion planner without requiring additional training.
Place the cola on the coaster
Place the water on the coaster
Place the blue bottle on the coaster
Make turquoise liquid in the measuring bowl
Make green liquid in the cylinder
Arrange the cola on the shelf
Arrange the cola and water on the shelf
All videos show experiments at 5× speed.