Learning Visuo-Haptic Skewering Strategies for Robot-Assisted Feeding

Priya Sundaresan,        Suneel Belkhale,        Dorsa Sadigh 

Video Attachment



Acquiring food items with a fork poses an immense challenge to a robot-assisted feeding system, due to the wide range of material properties and visual appearances present across food groups. Deformable foods necessitate different skewering strategies than firm ones, but inferring such characteristics for several previously unseen items on a plate remains nontrivial. Our key insight is to leverage visual and haptic observations during interaction with an item to rapidly and reactively plan skewering motions. We learn a generalizable, multimodal representation for a food item from raw sensory inputs which informs the optimal skewering strategy. Given this representation, we propose a zero-shot framework to sense visuo-haptic properties of a previously unseen item and reactively skewer it, all within a single interaction. Real-robot experiments with foods of varying levels of visual and textural diversity demonstrate that our multimodal policy outperforms baselines which do not exploit both visual and haptic cues or do not reactively plan. Across 6 plates of different food items, our proposed framework achieves 71% success over 69 skewering attempts total.

Approach Overview

Our proposed framework: A RetinaNet bounding box detector localizes food items on a plate. We employ SkeweringPoseNet to estimate the pose of an item on the plate and learned visual servoing for approaching the item. Next, we probe the given food item and sense visual and haptic readings upon contact. These multisensory readings serve as input to our HapticVisualNet, which predicts the optimal skewering strategy on the fly, within the same continuous interaction.

Skewering Primitive Parameterization

Action Space


We consider an action space that modulates fork pitch and fork roll for sensitivity to item geometry and deformation, respectively, as well as fork position. Provided this action space, we instantiate two manipulation primitives. In vertical skewer, the fork swiftly pierces an item in a swift, downward motion before scooping, which is preferable for rigid items like raw carrots. In angled skewer, the fork gently tilts during insertion to support a fragile, soft, or slippery item. 


Our probe-then-skewer approach operates by bringing the fork in contact with an item while recording a history of haptic readings and a post-contact image. 

To decide amongst the above skewering strategies upon contact, we defer to our HapticVisualNet, a network that takes as input the multisensory observations and outputs primitive likelihoods. Finally, keeping the fork in contact with the item, we execute either vertical skewer or  angled skewer.


We evaluate HapticVisualNet on the task of clearing the following six plates. Plates 1, 2, 3, and 4 consist of items in training distribution and Plates 5 and 6 are unseen.

Raw broccoli, banana, carrots, zucchini, grapes.

Pineapple, mango, watermelon, canteloupe, dragonfruit, pear.

Raw/boiled butternut squash

Raw broccoli

Pasta, raw/boiled root veggies, dumplings

Snow peas, mochi, canned pears


We visualize rollouts of HapticVisualNet (OURS) compared to baselines (HapticOnly, VisualOnly, and SPANet) on the above 6 plates.



Plate 1: 9/10


Plate 2: 10/11


Plate 1: 9/10


Plate 2: 10/11


Plates 5/6: 14/25 (Generalization Tests)


HapticVisualNet Failure Modes: 



VisualOnly: This method is OURS without haptic context; it suffers in the setting of visually similar items with physically different properties (e.g. raw/boiled butternut squash) where predicting vertical skewer for soft items may cause damage, and angled skewer may fail to pierce hard items.

HapticOnly: This baseline is OURS without visual context; it naively learns to predict vertical skewer when force readings are high and angled skewer otherwise. This strategy breaks down for anomalous items like broccoli which can yield low haptic readings but require vertical skewering.

SPANet: We implement SPANet [1] which proposes an open-loop (no probing) visual-only policy that draws from a taxonomy of 6 skewering strategies. Without probing, SPANet may fail to infer object properties that inform skewering, leading to consecutive failures.

Quantitative Evaluation

Manipulation Success Rates and Characterization of Failure Modes

HapticVisualNet achieves higher empirical skewering success with fewer and less severe failures compared to baselines.


We plot the confusion matrices for HapticVisualNet and two variants without haptic and without visual context, respectively. We see that HapticVisualNet obtains highest accuracy by exploiting both modalities jointly, compared to both baselines. 

(Lighter is better for diagonal, darker is better for off-diagonal)

Additional Physical Experiments 

We stress-test HapticVisualNet (OURS) on a plate of frozen fruits (mango, pineapple, strawberry). Our policy successfully skewers the fruit in 20/25 attempts. The failures observed are slips during probing due to the highly rigid, slippery texture of certain frozen fruits. The network predominantly infers vertical skewering as expected on such hard items, but occasionally predicts angled tines, possibly due to the effects of thawing over time.

Next, we evaluate HapticVisualNet on a cluttered plate of sautéed vegetables (carrot, celery, baby corn, mushrooms) and tofu with and without soy sauce, resulting in 20/23 and 25/34 successful skewers over attempts, respectively. We note that as these items are highly out of distribution, they incur more bounding box failures and ServoNet precision errors, leading to more frequent near misses shown at the end. Still, HapticVisualNet shows generalization to charred/oiled/sauce-coated/seasoned items with varying levels of doneness. In the future, we hope to generalize to more drastic cluttered environments and food types.