While current robot manipulation often focuses on changing the positions of objects (e.g., pick-and-place), a wide range of real-world human manipulation involves non-rigid object state changes—such as smashing or spreading—where an object’s visual and physical state evolve gradually over time. Our core insight is that many of these tasks share a common structural pattern: they involve spatially progressing, object-centric transformations that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, we integrate spatially progressing object change segmentation maps to provide dense visual affordance cues, using them both as policy observations and to automatically generate rewards reflecting the extent of visual transformation over time. Our formulation enables highly sample-efficient online reinforcement learning without demonstrations, simulation, or costly manual reward annotation. Furthermore, thanks to the abstraction into spatially transformed areas, our method allows direct generalization to new manipulated objects. We validate our SPARTA approach on a real robot for two challenging and previously unaddressed tasks—spreading and smashing—across 4 diverse real-world objects, achieving improvements of 79% in training time and 41% in accuracy over sparse rewards and visual goal-conditioned baselines.
At each episode step, our policy takes the current and past SPOC visual-affordance (segmentation) maps as inputs, along with the robot arm's proprioception data and predicts a displacement action for the arm's end-effector. We train the policy using RL with a novel reward function that incentivizes the robot to keep transforming the actionable object regions as efficiently as possible. Using visual-affordance map inputs facilitate zero-shot transfer to novel objects of vastly different shapes, texture and color (e.g., tortilla or cheese vs.~bread) and novel tasks (e.g., smashing vs.~spreading).