Overview
Framework
We design a unified Transformer-based model architecture to understand the multi-modal data and output picking and placing action with task completion prediction. We introduce a visible connectivity graph to tackle deformable objects’ complex configurations and dynamics.
Examples of language-conditioned deformable object manipulation tasks
Seen instructions, unseen instructions, unseen tasks are marked in black, grey and red, respectively
Videos of robot executions in the real-world experiments
Task: corner folding
 corner2_3x.mp4
corner2_3x.mp4 corner1_3x.mp4
corner1_3x.mp4Task: triangle folding
 tri1_3x.mp4
tri1_3x.mp4 tri2_3x.mp4
tri2_3x.mp4Task: Half folding
 half1_3x.mp4
half1_3x.mp4 half2_3x.mp4
half2_3x.mp4Task: T-shirt folding
 tshirt2_3x.mp4
tshirt2_3x.mp4 tshirt1_3x.mp4
tshirt1_3x.mp4Task: Trousers folding
 trousers1_3x.mp4
trousers1_3x.mp4 trousers2_3x.mp4
trousers2_3x.mp4If you have any questions, please feel free to contact us via :
mok21@mails.tsinghua.edu.cn
yuhongdeng@u.nus.edu