DeLTa

DeLTa: Demonstration and Language-Guided

Novel Transparent Object Manipulation

Our goal is to enable robots to execute manipulation tasks on novel transparent objects by leveraging single-object human demonstration trajectories from task instructions.

Human Demonstration

→

Robot Manipulation

Abstract

Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities. Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a VLM-based planner that refines planning for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities.

Long-Horizon Task Demonstration

Method

Overview of our DeLTa framework. our framework comprises three key components for precisely manipulating transparent objects in long-horizon tasks from the language instruction. By leveraging vision foundation models, 6D object trajectories are extracted from single-object demonstration videos. These trajectories transfer to novel objects through our last-inch motion planner without requiring additional training.