Section 1: Ablation Study on Key Contributions
Summary:
MENTOR_w/o_MoE consistently outperforms MENTOR_w/o_TP_MoE and MENTOR_w/o_TP outperforms MENTOR_w/o_TP_MoE in 4 out of 5 tasks, indicating that both the MoE architecture and Task-oriented Perturbation independently contribute to improved policy learning.
However, the overall sample efficiency and performance of MENTOR_w/o_TP and MENTOR_w/o_MoE remain lower than the full MENTOR model. This underscores the complementary nature of these two components in enhancing learning efficiency and robustness.
UPDATE Clarification
Section 2: Evaluation in Multi-task Environment
Multitask In Simulation: MT5 task (Door-Open, Drawer-Open, Window-Open, Drawer-Close, and Window-Close) Evaluation
Multitask In Real-World: Peg Insertion task (Star, Triangle, Arrow) Evaluation
Section 3: Evaluation with ViT
We implemented a vision transformer (ViT) encoder following the setup from previous work. Specifically, we replaced the CNN visual encoder in the DrM baseline with a ViT encoder. This ViT processes 84×84 images with 12×12 patches. The patch embeddings have a dimension of 128, with 4 transformer layers, each having 4 attention heads. To avoid running out of GPU memory, we set the batch size to 32.
Due to the time constraint, we did not finish the whole training process in the Hammer task. However, the ViT-based policy did not lead to better performace compared with DrM.
UPDATE: We have conducted experiments using the ViT-based method with the same batch size as MENTOR (bsz=256) on the Hammer (Sparse) task. The performance of ViT-bs256 surpasses both DrM and ViT-bs32 but remains significantly less efficient than MENTOR and even MENTOR_w/o_MoE and MENTOR_w/o_TP as shown in Section 1.
Section 4: Random Disturbances in Simulation
In the Assembly task from Meta-World, we randomly reset the target location (pillar location) during rollouts. The agent trained without encountering this type of disturbance achieves a 9/10 success rate (10/10 success rate in the absence of any disturbances).
SUCCESS
SUCCESS
SUCCESS
SUCCESS
FAILURE
Section 5: Alleviation Gradient Conflicts in Single Task
Section 6: Dormancy Comparison (MoE and MLP)
Neural Network’s dormant ratio is an effective index reflecting the agent’s skill acquisition capabilities: lower dormant ratio indicates better learning ability. As illustrated in Section 3.1 in the original paper and Section 5 on the rebuttal website, the using of MoE structure indeed can enhance the agent’s learning capabilities through the alleviation of sharing parameters. Thus, it is reasonable that MoE agents have lower dormancy than MLP agents.
Section 7: Ablation Study on MoE hyperparameters
The results indicate that, for the Hammer (Sparse) task, the optimal choice for the number of experts is 8 and for top_k is 4. When top_k is 4, there are no significant performance differences when the number of experts is set to 4, 8, or 32, which suggests that 4 experts are enough to learn the skill in this task. The ablation on top_k further validates our hypothesis, as reducing top_k (to 2) results in a worse learning curve. If the number of experts is set to 1 and top_k is also 1, the MoE will downgrade to a standard MLP, resulting in the worst performance among all configurations.