Additional Results
End-to-End Baseline:
In addition to the Acquire-Only and Heuristic approaches we empirically compare VAPORS against, we also train an end-to-end model-based RL agent for spaghetti acquisition in simulation. We train the agent using PlaNet [7] with the same objective as VAPORS, an observation space of 64x64 segmented images, and a continuous action space consisting of the 6DoF pose of the fork, as opposed to our vision-parameterized twirling and grouping primitives. Empirically, we find that the agent is unable to make any progress towards long-horizon acquisition (top video of failed twirl). We also do not observe any emergent grouping or twirling behaviors that are effective enough to pick up food, which would likely require significant reward engineering or a more expressive action space such as through parameterized primitives. This is in sharp contrast to VAPORS (bottom video) which can effectively leverage grouping to create a pile followed by a parameterized twirl motion.
End-to-End
VAPORS
Across 10 evaluation rollouts per method and 3 random seeds, the end-to-end baseline achieves just below 20% of plate clearance on average compared to over 80% with VAPORS for the same acquisition clock time.
Random Baseline:
We additionally evaluate VAPORS against a Random Primitive baseline in simulation, which replaces VAPORS's high-level planner with random selection amongst amongst Grouping and Acquiring at each timestep. Initially, the Random Primitive baseline performs comparably to VAPORS, and the Heuristic baseline acquires very little as it is primarily grouping noodles into a pile. Towards the later timesteps, the plate becomes more and more sparse; here, we see that VAPORS has a strong sense of when to Group vs. Twirl, which helps pick up the last 20% and clear the plate. Meanwhile, Random and Heuristic struggle to reach 80% clearance. The Random baseline fails to consistently gather noodles together on a sparse plate, and Heuristic has wasted too much of the action budget grouping to acquire enough noodles.
10 evaluation rollouts per method x 3 random seeds
Robustness to Redundant Primitives:
We stress-test the robustness of VAPORS' high-level policy when the action space contains redundant primitives. In the simulated noodle acquisition setting, we add 3 additional no-op actions that do nothing, in addition to the standard twirling and grouping primitives. On the right, we rollout the trained policy at each epoch, over 100 training epochs, and visualize the proportion of no-op actions that are taken per evaluation episode (action budget of 10 primitives per episode). We plot the average proportion of no-op actions across 4 policies trained with different random seeds, with the shaded regions indicating 1 standard deviation from the mean.
Initially, the policy tends to take a no-op action up to half of the time when selecting amongst the 5 primitives. After training, the policy takes a no-op action less than 10% of the time, suggests VAPORS ability to ignore irrelevant primitives in favor of those that make task progress.
Expanding the Limited Set of Primitives
We fully acknowledge that VAPORS' small number of primitives is one of the biggest current limitations. That said, we are actively working on designing more parameterized primitives while reusing the visual state representations used in this work (segmentation masks and pose estimates). Here, we visualize some demos of additional primitives as part of ongoing extensions to VAPORS. This is not visualizing the result of VAPORS' autonomous high-level policy in any way (primitives are manually selected in these videos). These videos should only be interpreted as initial proofs of concept into alternate primitive implementations.
Fettucini Alfredo + Chicken + Broccoli
Skewering, Twirling, Grouping
Celery + Ranch
Skewering, Dipping
Mashed Potatoes
Scooping
Reward Design Ablations:
VAPORS is trained to maximize the following objective:
𝛼(PICKUP GAIN) + (1 - 𝛼)(COVERAGE LOSS)
This raises the import question of how the policy is affected by the choice of 𝛼, for which we conduct ablations in simulated spaghetti acquisition. In particular, we train VAPORS with different settings of alpha, a horizon of 8 discrete actions consisting of grouping or twirling, 20 initial noodles on the plate, and evaluate with 3 random seeds per policy and 10 rollouts where we plot the plate clearance over time. As expected, for 𝛼=1, the policy is incentivized to twirl alone, leading to inefficient acquisition over time. For 𝛼=0.25, the policy takes wasteful grouping actions to greedily minimize coverage, also leading to inefficient acquisition. For 𝛼 ∈ [0.25, 0.75], we see that the policy more effectively balances the two strategies, with 𝛼=0.5 producing the most favorable results.
Visual State Space:
For homogeneous plates with a single category of food, VAPORS can effectively leverage binary food segmentation as a representation.
In initial testing with off-the-shelf open-vocabulary segmentation models like Detic [9], we find that such models produce reasonable results when prompted with food groups like 'noodles.' With additional finetuning, we posit that these models may help scale VAPORS to more diverse plates with more food categories, and possibly circumvent the need to collect data and train our segmentation networks from scratch.
User Studies: Statistical Tests
For each criterion (efficiency, bite size, similarity to human feeding, practicality, likely for reuse, safety, trust, and generalization) in user-study evaluations, we conduct a 1-way ANOVA test to assess the mean ratings of each method (VAPORS vs. Acquire-Only vs. Heuristic). A p-value of 0.05 indicates a statistical significant difference.
We hypothesized that VAPORS would garner the highest user ratings with statistical significance across these categories. Below, we report the full p-values per criterion where there is a pairwise statistically significant difference. Our hypothesis is confirmed for 7/8 criteria in noodle acquisition (Efficiency, Bite Size, Humanlike, Practicality, Reuse, Trust, Generalizability), and 6/8 in jelly bean acquisition (Efficiency, Bite Size, Practicality, Reuse, Trust, Generalizability).