Overview
Multi-task learning of deformable object manipulation is a challenging problem in robot manipulation. Most previous works address this problem in a goal-conditioned way and adapt goal images to specify different tasks, which limits the multi-task learning performance and can not generalize to new tasks. Thus, we adapt language instruction to specify deformable object manipulation tasks and propose a learning framework. We first design a unified Transformer-based architecture to understand multi-modal data and output picking and placing action. Besides, we have introduced the visible connectivity graph to tackle nonlinear dynamics and complex configuration of the deformable object. Both simulated and real experiments have demonstrated that the proposed method is effective and can generalize to unseen instructions and tasks. Compared with the state-of-the-art method, our method achieves higher success rates (87.2% on average) and has a 75.6% shorter inference time. We also demonstrate that our method performs well in real-world experiments.
Multi-task learning of deformable object manipulation is a challenging problem in robot manipulation. Most previous works address this problem in a goal-conditioned way and adapt goal images to specify different tasks, which limits the multi-task learning performance and can not generalize to new tasks. Thus, we adapt language instruction to specify deformable object manipulation tasks and propose a learning framework. We first design a unified Transformer-based architecture to understand multi-modal data and output picking and placing action. Besides, we have introduced the visible connectivity graph to tackle nonlinear dynamics and complex configuration of the deformable object. Both simulated and real experiments have demonstrated that the proposed method is effective and can generalize to unseen instructions and tasks. Compared with the state-of-the-art method, our method achieves higher success rates (87.2% on average) and has a 75.6% shorter inference time. We also demonstrate that our method performs well in real-world experiments.
Framework
We design a unified Transformer-based model architecture to understand the multi-modal data and output picking and placing action with task completion prediction. We introduce a visible connectivity graph to tackle deformable objects’ complex configurations and dynamics.
Examples of language-conditioned deformable object manipulation tasks
Seen instructions, unseen instructions, unseen tasks are marked in black, grey and red, respectively
Videos of robot executions in the real-world experiments
Task: corner folding
corner2_3x.mp4
corner1_3x.mp4
Task: triangle folding
tri1_3x.mp4
tri2_3x.mp4
Task: Half folding
half1_3x.mp4
half2_3x.mp4
Task: T-shirt folding
tshirt2_3x.mp4
tshirt1_3x.mp4
Task: Trousers folding
trousers1_3x.mp4
trousers2_3x.mp4
If you have any questions, please feel free to contact us via :
mok21@mails.tsinghua.edu.cn
yuhongdeng@u.nus.edu