Gawtam Chithra Ramesh¹*, David Blanco-Mulero²*, Yifei Dong¹, Julia Borras², Carme Torras², Florian T. Pokorny¹
*Equal contribution
¹RPL, KTH, ²Institut de Robotica i Informatica Industrial, CSIC-UPC
Presented at ROMADO'25 IROS workshop | Paper
Vision-Language Models (VLMs) can describe scenes in natural language, supporting tasks such as robot planning and action grounding. However, they struggle in deformable object manipulation (DOM), where reasoning about motion, interaction, and deformation is critical. In this work, we investigate whether guiding language models with a taxonomy for DOM can provide a structured reasoning of DOM tasks. We evaluate the performance of our approach on three challenging DOM tasks: towel twisting, meat phantom transport, and cloth edge tracing. Our results demonstrate the potential of taxonomy-guided VLMs to interpret these tasks with no fine-tuning or curated datasets.
Content
Júlia Borràs
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Spain