Hanyi Zhao*, Jinxuan Zhu*, Zihao Yan*, Yichen Li, Yuhong Deng, Xueqian Wang†
Video Overview
Overview
Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step clothes manipulation skills for both seen and unseen tasks.
Overall Framework
Our framework consists of three stages.
Stage 1: Skill Discovery from Demonstrations
In the first stage, we perform skill discovery from extensive cloth manipulation demonstrations by inputting language instructions and associated prompts into a large language model (LLM). Coupled with corresponding depth images, these inputs form a language-conditioned basic skill dataset, which is subsequently used for training the foundational skills.
Stage 2: Skill Learning with Transformer-Based Model
In the second stage, using this dataset, a transformer-based model learns the skills. The model takes as input the language instruction and the current observation's depth image, and outputs a heatmap indicating the pick or place positions.
Stage 3: Task Planning and Composition
In the final stage, an LLM-based task planner composes the learned basic skills to address unseen multi-step manipulation tasks.
LLM-Based Task Planner for Extracting Fabric Manipulation Skills
Our framework leverages LLM to extract fundamental skills from extensive demonstrations. The system receives a language instruction along with corresponding action steps and textual prompts as input. The LLM then automatically decomposes the task into a sequence of "pick" and "place" instructions. Each of these instructions is paired with an observation image, creating a language-conditioned basic skill dataset that supports subsequent skill learning.
By utilizing the diverse range of tasks found within long demonstrations, our approach can acquire a variety of fabric manipulation skills. These skills encompass the majority of "pick" and "place" actions across different primitive parts of the fabric. This comprehensive set of skills forms a robust foundation for generalizable manipulation tasks, enabling the system to handle a wide array of fabric handling scenarios effectively.
Videos of robot executions in the real-world experiments:
Task 1: Fabric Fold - "Fold four corners of the fabric towards the center."
Task 2: Fabric Fold - "Fold the fabric in half from up to down."
Task 3: Trousers Fold - "Fold the trousers in half from up to down."
Task 4: T-Shirt Fold - "Fold the two sleeves of the T-shirt towards the middle part. Then fold it in half from up to down."
If you have any questions, please feel free to contact us via :
landertelon@gmail.com