Cloth folding stands as an intricate subject in robot manipulation, requiring robots to fold diverse fabrics into different configurations according to human intentions. Previous work in this area falls into three primary categories: imitation learning, reinforcement learning, and geometric model-based planning methods. While each paradigm has its merits, they generally lack inherent multi-step reasoning ability and struggle to generalize to novel cloth appearances and tasks. To tackle these problems, our key insight is incorporating the common sense reasoning and generalization abilities of Large Language Models (LLMs) into cloth manipulation, while addressing the limitations of LLMs in manipulating deformable objects, which involves an effective grounding module and rational planning hierarchy. To this end, we present PolyFold, a novel language-conditioned bimanual cloth folding framework that leverages the parameterized polygon model as an effective abstraction and grounding module for cloth representation. Moreover, PolyFold enables LLMs to infer an intermediate-level action—specifically, the symmetrical fold line, while delegating the pick-and-place calculations to a fold-line-guided downstream policy, which is learned through self-supervision using random data. Experiments on 70 cloth folding tasks and 4 cloth types show that PolyFold excels in zero-shot generalization and inherent multi-step reasoning capability, while also operating in a sample-efficient expert-demonstration-free manner, surpassing previous SOTA vision-conditioned and language-conditioned methods. Our method can also be directly deployed in real-world scenarios.
Here we display real-world experiments of different types of tasks. Tasks are classified into different types, denoted as <Cloth Type>-<Folding Type>-Folding. For cloth type, 'S', 'R', 'T', and 'P' refer to square, rectangle, t-shirt, and pant cloth respectively. None of the evaluated cloth objects and evaluated tasks have been seen before.
S-Corner-Folding
"Fold all corners of the square cloth into the center."
"Bring all corners of the square towards the center."
"Fold both the top right and bottom left corners of the square towards the center."
S-Triangle-Folding
"Fold the square into a shape whose area is one fourth of its original area. The achieved shape is a triangle."
"Converge the top-right corner towards bottom-left corner and then bring the top-left corner down to meet bottom-right corner."
"Bring the bottom-right corner up to meet the top-left corner, and then fold the bottom-left corner up to meet the top-right corner."
R-Edge-to-Middle-Folding
"Fold the bottom edge of the rectangle cloth upward to the horizontal middle line, and then repeat the same for the top edge."
"Bring the top edge of the rectangle cloth downward to the horizontal middle line, and then repeat the same for the bottom edge."
"Position the right edge of the rectangular cloth leftwards, meeting it with the vertical center line, then do the same for left edge."
R-Edge-to-Opposite-Folding
"Fold the rectangular cloth in half from bottom to top and then fold it in half from left to right."
"Take the top edge of the rectangle and fold it towards the bottom edge, then fold the left edge towards the right edge."
*Square cloth is a sepcial type of rectanglular cloth, so here we also use some square cloth for evaluation of rectangle folding tasks.
"Fold the square so that it remains a square, but with side lengths half of the original."
*Square cloth is a sepcial type of rectanglular cloth, so here we also use some square cloth for evaluation of rectangle folding tasks.
T-Sleeve-Folding & T-Half-Folding
"Converge the left and right sleeves of t-shirt in half, letting the sleeve edges meet the armpit-shoulder lines."
"Fold the sleeves of the t-shirt inward. However, the sleeves are too long, you cannot fold them inward directly as they will exceed the main body of the garment. "
"Fold the t-shirt in half from left to right."
T-Block-Folding
"Organize this t-shirt by folding it into a rectangular block in three steps. "
"Fold the t-shirt into a neat rectangle. "
"Fold both sleeves of the t-shirt towards the center, followed by folding the t-shirt in half from bottom to top."
P-Half-Folding & P-Block-Folding
"Fold the pant into a rectangular block in two steps. "
"Fold the right leg of the pant in half from bottom to top. Then do the same for the left leg."
*As the length of the pant leg exceeds the operation space of the ABB robot, we place it like this. When inputting the image into the algorithm, we rotate the image 90 degrees clockwise, so that it matches the language description.
"Fold the pant in half from left to right and then fold the bottom edge of the pant upwards to meet the top."
Figure. 70 evaluated tasks in SoftGym simulator.
VCD utilizes a mesh edge prediction model to obtain mesh based on current point cloud observation. Based on the mesh structure and applied pick-and-place action, the dynamics model predicts the future state. Here we visualize the mesh edge prediction and dynamics model of VCD on four types of cloth and under various initial states.
It may take time to load 3D visualizations. If it gets stuck, you can refresh the website.
Current Observation and Pick-and-Place Action Applied
Predicted Mesh Edges
Predicted Result by Dynamics Model
Ground Truth Execution Result in Simulation
From the visualizations we can see that VCD model can achieve relatively acceptable results on simple cloth shapes like square cloth while starting from a flattened stage, but the dynamics model predicts bad results on complex shapes and folded clothes. What's more, when performing random-shooting based planning using the dynamics model, the action space is very large and it is hard to find an accurate action. These two factors contribute to the bad performance of VCD in comparison experiments.
Wang et al. proposes a method that converts the dense point clouds into simplified cloth mesh given current observation and goal state, and utilizes vertices of the simplified cloth mesh as the reduced action space. Then it performs model-based planning in the simplified action space. Here we visualize the simplified mesh given current observation and goal configuration. The red points in the third column are used as the action space of pick points, and the blue points in the fourth column are used as the action space of place points.
It may take time to load 3D visualizations. If it gets stuck, you can refresh the website.
Current Observation
Goal State
Simplified Mesh (Current Obs)
Simplified Mesh (Goal)
From the visualizations we can see that the mesh simplification process of Wang et al. is relatively good on simple tasks in square and rectangular clothes. But when it comes to complex shapes like t-shirts or pants, the simplification process will fail. Then the wrong action space will lead to a failure in model-based planning.