t-dom

Scene Understanding in Deformable Object Manipulation

via Taxonomy-Guided Vision-Language Models

Gawtam Chithra Ramesh¹*, David Blanco-Mulero²*, Yifei Dong¹, Julia Borras², Carme Torras², Florian T. Pokorny¹

*Equal contribution

¹RPL, KTH, ²Institut de Robotica i Informatica Industrial, CSIC-UPC

Presented at ROMADO'25 IROS workshop | Paper

Vision-Language Models (VLMs) can describe scenes in natural language, supporting tasks such as robot planning and action grounding. However, they struggle in deformable object manipulation (DOM), where reasoning about motion, interaction, and deformation is critical. In this work, we investigate whether guiding language models with a taxonomy for DOM can provide a structured reasoning of DOM tasks. We evaluate the performance of our approach on three challenging DOM tasks: towel twisting, meat phantom transport, and cloth edge tracing. Our results demonstrate the potential of taxonomy-guided VLMs to interpret these tasks with no fine-tuning or curated datasets.

Content

System Prompt for VLM

Results

Quantitative Results

Examples of VLM responses

Tasks Human-Labeled using T-DOM

Task 1: Twist Towel

Task 2: Transport Meat

Task 3: Cloth Edge Tracing

Team