MuJuCo

Overall Performance

The results on six MuJoCo control environments. In each environment, we visualize the training reward over the steps on the top and the switching cost in a log scale at the bottom.

Ablation Study

We change the switching interval of the non-adaptive switching criterion, where FIX_n means we switch the deployed policy every n steps. Larger n can reduce the switching cost, but may cause the training to fail. 1000 seems like an appropriate interval for this criterion.

We change the similarity threshold for feature based criterion, where a smaller threshold can reduce the switching cost, but may hurt the performance.

We change the KL threshold for policy based criterion, where a larger threshold can reduce the switching cost, but may hurt the performance.