Ablation : Comparing Observations Spaces
Since our approach involves only learning value functions for stabilizing behavior during non- prehensile object transport, we found that we do not need to condition the learned value function on the end-effector targets. We hypothesize that the end-effector rotation (Re ) is the most critical observation for the value functions to learn effectively and generalize across different start locations. To validate this hypothesis, we conducted ablation studies under two conditions: using the same end-effector start location for both training and testing and using different start locations for training and testing (with the start location remaining consistent within each trial of 20 episodes), and examine four sets of observations: (a) position, velocity, acceleration, and rotation of the end-effector, (b) velocity, acceleration, rotation, (c) velocity and acceleration, (d) rotation
Ablation : Point-wise pessimism (PWP) vs. Initial State Pessimism (ISP)
We aim to investigate the effect of the Initial-State Pessimism (ISP) scheme in comparison to the Pointwise Pessimism (PWP) scheme for handling sparse reward data from demonstrations. For this, we train an ensemble of value functions using 50 demonstrations, employing the same cube at the center and a similar experimental setup as used in previous ablation studies. During inference, we compute a conservative value estimate, instead of inducing pessimism estimates at all intermediate rollout steps. We hypothesize that excessive pessimism can lead to motions that may hinder performance in challenging dynamic tasks, such as the robot waiter problem. This hypothesis is supported by the results shown in the bar plots in figure shown below.