The videos below present policy rollouts before and after Stage 2 Self-Improvement for the BananaTable setting. The BananaTable task is created by removing all blocks from the table and replacing them with a single foam banana. This is a very out of distribution task setup for the policy and reward models, as they have never seen a banana in the LanguageTable dataset, and they have never seen the board without the blocks. As opposed to the forms of generalization shown in prior work with MFA's which was limited to semantic generalization (e.g. performing the same pick and place motion but with new distractors or target locations), the BananaTable task requires new manipulation behaviors to be learned since moving a banana effectively is significantly more difficult than moving blocks around the table due the geometry of the banana.
On the left side we view policy before Stage 2 Self-Improvement on the BananaTable task. This policy is the Stage 2 policy from the 80% LanguageTable setting.
On the right side we view this policy after our Stage 2 Self-Improvement procedure on the BananaTable task, after approximately 8 hours of training.
Before (~63% success rate)
After (~85% success rate)
The first two videos below contain two plots. The top plot corresponds to expected value of the model's prediction for "timesteps until success" across the episode, and the bottom plot shows the probability distribution of the model's predictions at the specific moment in time. The second two videos do not have a visualization for the probability distribution.
Both videos below are a concatenation of a number of episodes of policy rollouts, with plots underneath representing the expected value of the model's prediction for "timesteps until success" across the episode. The first video shows rollouts for 20% data Stage 1 Supervised Fine-Tuning policy, and the second video shows policy rollouts for this policy after going through our Stage 2 Self-Improvement process.