As demonstrated in the left figure, when a bimodal distribution (blue) is combined with a Gaussian distribution (orange), the sum distribution (green) still preserves its bimodal nature. This process effectively shifts the multi-modal distribution and adjusts the standard deviation of its modes. The multi-modal property is maintained as long as the Gaussian distribution's variance remains relatively small compared to the separation between modes.
To demonstrate the preservation of multi-modality in practice, we visualize action distributions from a specific state in the ManiSkill StackCube task, using Behavior Transformer as the base policy. We sampled 1000 actions from both base and residual policies, then applied PCA dimensionality reduction for visualization purposes. We use histograms to visualize these action samples. The results are shown in the left figure.
The results in the left figure demonstrate that BeT's forward pass is significantly faster than its backward pass, with this gap becoming more pronounced as model size increases. This confirms that the backward pass constitutes the major training time bottleneck.
Implementation Details:
Batch Size: 1024
GPU: NVIDIA GeForce RTX 2080 Ti
Results averaged over 100 independent runs
Following reviewer PZbK's suggestion, we have implemented and tested the GAIL + MLP baseline. The results in the left figures show that this baseline achieves 0% success rate on StackCube and about 20% success rate on TurnFaucet after 3M environment interactions. These results are expected given that the demonstrations were collected task and motion planning (for StackCube) and model predictive control (for TurnFaucet) - resulting in naturally multi-modal distributions. These results suggest that simple MLPs may be insufficient for capturing multi-modal distributions and highlight the need for large policy models for effectively utilizing multi-modal demonstrations.
We fixed the entropy coefficient during the warm-start phase and enabled auto-tuning during subsequent fine-tuning. Results are shown in the left figure. From six independent runs, three of them still blow up upon unfreezing while the other three remained stable. This result indicates that this unfreezing strategy does not effectively address the training stability issue associated with warm-starting.