Mobile manipulation is a critical capability for robots operating in diverse, real-world environments. However, manipulating deformable objects and materials remains a major challenge for existing robot learning algorithms. While various benchmarks have been proposed to evaluate manipulation strategies with rigid objects, there is still a notable lack of standardized benchmarks that address mobile manipulation tasks involving deformable objects. To address this gap, we introduce MoDeSuite, the first Mobile Manipulation Deformable Object task suite, designed specifically for robot learning. MoDeSuite consists of eight distinct mobile manipulation tasks covering both elastic objects and deformable objects, each presenting a unique challenge inspired by real-world robot applications. Success in these tasks requires effective collaboration between the robot's base and manipulator, as well as the ability to exploit the deformability of the objects. To evaluate and demonstrate the use of the proposed benchmark, we train two state-of-the-art reinforcement learning algorithms and two imitation learning algorithms, highlighting the difficulties encountered and showing their performance in simulation. Furthermore, we demonstrate the practical relevance of the suite by deploying the trained policies directly into the real world with the Spot robot, showcasing the potential for sim-to-real transfer. We expect that MoDeSuite will open a novel research domain in mobile manipulation involving deformable objects.
The framework includes a variety of different environments and two different robotic platforms. The robot perceives the environment through multimodal observations, including RGB and depth images, proprioceptive states, and object-specific information. Based on these inputs, it generates actions to interact with the environment. The interaction dynamics are powered by the NVIDIA PhysX engine, which provides accurate modeling of rigid and deformable body physics. This enables realistic simulation of complex contact interactions and collisions, which is critical for tasks involving deformable materials.
Task Table
Place
Put the long elastic rod on the table
The distance between the table surface’s middle point and the rod endpoint
Bend
Bend the rod with the help of the wall and go through the corner
The distance between the red target cube and the end point of the rod
Transport
Transport the rod to the target point by passing the obstacle in the middle of the corridor
The distance between the target point and the end point of the rod
Drag
Drag the rod to move it to the other side of the obstacle
The distance between the middle point of the rod and the purple target point
Lift
The belt hangs on as a curtain, the robot needs to lift the belt, and then arrive at the target point
The distance between the middle point of the belt and the target point, the distance from the robot to the final target point
Uncover
The robot must approach the table and remove the table cover.
The Table cover is removed, and its handle has been pulled beyond the other side of the table
Cover
The robot must approach the table and remove the table cover.
The Table cover is removed, and its handle has been pulled beyond the other side of the table
Curtain
The robot needs to move the cloth aside and then navigate its body past the hanger without collision.
The robot walks through the curtain without collision
In our experiments, we aim to systematically evaluate the effectiveness of our task suite in training agents capable of performing mobile manipulation tasks involving deformable objects. Specifically, we study: (1) the ability of agents to learn from interaction and demonstration in simulation, (2)the impact of different input state-based versus image-based perception on learning, and (3) the zero-shot transferability of learned policies from simulation to the real world without fine-tuning. Our experimental design incorporates a variety of observation types (state and image), learning paradigms (reinforcement learning and imitation learning), and evaluation metrics to assess both learning efficacy and sim-to-real performance gaps.
We evaluate the performance of SAC and PPO algorithms on five MoDeSuite tasks: Place, Bend, Transport, Lift, and Drag. The experiments use state-based observations and are conducted on two robot platforms—Franka and Spot. Success rates from 20 evaluation trials are visualized in the bar plots. These results demonstrate how both robot morphology and algorithm choice significantly impact task performance.
This section presents the success rates of Behavior Cloning and Retrieval-based methods across three challenging deformable mobile manipulation tasks: Uncover, Cover, and Curtain. All models are trained using a full dataset of 30 demonstrations and evaluated over 20 trials per task. The results provide insights into each method’s generalization capabilities and effectiveness in handling real-world deformable object interactions.
To assess the practical applicability of policies trained in MoDeSuite, we transfer the learned models to physical hardware using the Boston Dynamics Spot robot. We focus on three representative tasks: Place, Drag, and Curtain. The Place and Drag tasks, which involve manipulating elastic objects, show promising results with effective sim-to-real transfer. However, the Curtain task exposes a significant sim-to-real gap, highlighting the ongoing challenges of visual domain generalization in real-world deployment.