Gawtam Chithra Ramesh, David Blanco-Mulero, Yifei Dong, Julia Borras, Carme Torras, Florian T. Pokorny
Submitted to ROMADO'25 IROS workshop
Vision-Language Models (VLMs) can describe scenes in natural language, supporting tasks such as robot planning and action grounding. However, they struggle in deformable object manipulation (DOM), where reasoning about motion, interaction, and deformation is critical. In this work, we investigate whether guiding language models with a taxonomy for DOM can provide a structured reasoning of DOM tasks. We evaluate the performance of our approach on three challenging DOM tasks: towel twisting, meat phantom transport, and cloth edge tracing. Our results demonstrate the potential of taxonomy-guided VLMs to interpret these tasks with no fine-tuning or curated datasets.
Content
Júlia Borràs
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Spain