Learning Generalizable Language-Conditioned Cloth Manipulation from Long Demonstrations

IROS2025

Hanyi Zhao*, Jinxuan Zhu*, Zihao Yan*, Yichen Li, Yuhong Deng†, Xueqian Wang†

Video Overview

video4.mp4

Overview

Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step clothes manipulation skills for both seen and unseen tasks.

Overall Framework

Our framework consists of three stages.

Stage 1: Skill Discovery from Demonstrations

In the first stage, we perform skill discovery from extensive cloth manipulation demonstrations by inputting language instructions and associated prompts into a large language model (LLM). Coupled with corresponding depth images, these inputs form a language-conditioned basic skill dataset, which is subsequently used for training the foundational skills.

Stage 2: Skill Learning with Transformer-Based Model

In the second stage, using this dataset, a transformer-based model learns the skills. The model takes as input the language instruction and the current observation's depth image, and outputs a heatmap indicating the pick or place positions.

Stage 3: Task Planning and Composition

In the final stage, an LLM-based task planner composes the learned basic skills to address unseen multi-step manipulation tasks.

LLM-Based Task Planner for Extracting Fabric Manipulation Skills

Our framework leverages LLM to extract fundamental skills from extensive demonstrations. The system receives a language instruction along with corresponding action steps and textual prompts as input. The LLM then automatically decomposes the task into a sequence of "pick" and "place" instructions. Each of these instructions is paired with an observation image, creating a language-conditioned basic skill dataset that supports subsequent skill learning.

By utilizing the diverse range of tasks found within long demonstrations, our approach can acquire a variety of fabric manipulation skills. These skills encompass the majority of "pick" and "place" actions across different primitive parts of the fabric. This comprehensive set of skills forms a robust foundation for generalizable manipulation tasks, enabling the system to handle a wide array of fabric handling scenarios effectively.

Videos of robot executions in the real-world experiments:

Task 1: Fabric Fold - "Fold four corners of the fabric towards the center."

1.mp4

2.mp4

Task 2: Fabric Fold - "Fold the fabric in half from up to down."

4.mp4

5.mp4

Task 3: Trousers Fold - "Fold the trousers in half from up to down."

6.mp4

7.mp4

Task 4: T-Shirt Fold - "Fold the two sleeves of the T-shirt towards the middle part. Then fold it in half from up to down."

3.mp4

If you have any questions, please feel free to contact us via :

landertelon@gmail.com

Page updated

Google Sites

Report abuse