The food scooping task demands fine-grained control, as even small deviations can result in spillage. GRITS addresses this challenge by leveraging predicted spillage probabilities to adaptively refine trajectories, leading to safer and more reliable manipulation.
With spillage predictor, GRITS can effectively capture the food states and dynamics at each step, and guide the robot motion generation accordingly to prevent risky spillage scenarios.
We collect 80 real-world expert demonstrations across diverse food types and quantities to train the diffusion scooping policy. On the other hand, to reduce labor-intensive cleanup required under real-world settings, we collect diverse food spillage and non-spillage cases in the simulation, which includes four primitive shapes: sphere, cube, cone, and cylinder and diverse physical properties, to train the spillage predictor.
Given an RGB-D image and an initial noisy trajectory, the diffusion policy denoises it into a refined trajectory. A spillage predictor, which takes segmented point clouds as input to reduce the sim-to-real gap, estimates the probability of spillage for given candidate trajectory. This probability provides a guidance signal that steers the denoising process toward safer trajectories. The robot then follows the refined trajectory using position control to scoop food items.
With spillage predictor, GRITS can effectively capture the food states and dynamics at each step, and guide the robot motion generation accordingly to prevent risky spillage scenarios, achieving highest 82% success rate and lowest 4% spillage rate.
The failure cases can be categorized into two types: spillage and scoop failure. Our experiments reveal that failure cases often arise from the lack of detailed information about the physical properties of food, beyond what can be captured by visual features alone. Attributes such as hardness, viscosity, and deformability strongly influence scooping outcomes and lead to difficulties across diverse food types. To address this limitation, future work could incorporate pre-interaction strategies and multimodal sensing, including force-torque and tactile feedback, to build a more comprehensive representation of food characteristics and improve robustness.