Experiments

Comparison to State-of-the-Art: Model-Free

Learning curves of MB-MPO and four state-of-the-art model-free methods in six different Mujoco environments with a horizon of 200. MB-MPO is able to match the asymptotic performance of model-free methods with two orders of magnitude less samples.

To reproduce the experimental results, run the following scripts in our code repository:

python experiments/run_scripts/mf_comparison/mb_mpo_train.py

python experiments/run_scripts/mf_comparison/trpo_train.py

python experiments/run_scripts/mf_comparison/ppo_train.py

python experiments/run_scripts/mf_comparison/ddpg_train.py

python experiments/run_scripts/mf_comparison/acktr_train.py

The experimental data is available under the following link:

https://www.dropbox.com/sh/yiv8e4rn9iyw3od/AAB836WmDi9ilohBzGHVhLM6a?dl=0

Comparison to State-of-the-Art: Model-Based

Learning curves of MB-MPO and two MB methods in 6 different Mujoco environments with a horizon of 200 time steps. Ours achieves better asymptotic performance and faster convergence rate than previous MB methods.

To reproduce the experimental results, run in our code repository:

python experiments/run_scripts/mb_comparison/mb_mpo_train.py

python experiments/run_scripts/mb_comparison/mb_mpc_train.py

python experiments/run_scripts/mb_comparison/model_ensemble_trpo_train.py

The experimental data is available under the following link:

https://www.dropbox.com/sh/5espmbm1kngn19z/AAB9U8ohGJPQ1Yb2LYWt3RnMa?dl=0

Robustness to imperfect models

Comparison of MB-MPO and model ensemble trust region policy optimization (ME-TRPO ) using 5 biased and noisy dynamic models in the HalfCheetah environment with 100 timesteps. For each dynamics model a bias is sampled uniformly from a denoted interval in every iteration. During the iterations we add Gaussian noise centered in the sampled bias with a standard deviation of 0.1 to the predictions of the dynamic model.

To reproduce the experimental results, run the following scripts in our code repository:

python experiments/run_scripts/bad_models_exps/run_mb_mpo_bad_models.py

python experiments/run_scripts/bad_models_exps/run_model_ensemble_trpo_bad_models.py

The experimental data is available under the following link:

https://www.dropbox.com/sh/0qtk18h4mclgul6/AACMOCYHFnosduLYD4QcZSHqa?dl=0

Robustness to compounding errors

Comparison of our method with and without adaptation. Depicted is the development of average returns during training with three different random seeds on the half-cheetah environment with a horizon of 1000 time steps.

Hyperparameter Study

Hyper-parameter study in the the half-cheetah environment of a) the inner learning rate b) the number of dynamic models in the ensemble, and c) the number of meta gradient steps before collecting real environment samples and refitting the dynamic models.

To reproduce the experimental results, run in our publicly available code repository:

python experiments/run_scripts/hyperparam_exps/run_mb_mpo_hyperparam_study_fast_lr.py

python experiments/run_scripts/hyperparam_exps/run_mb_mpo_hyperparam_study_maml_iter.py

python experiments/run_scripts/hyperparam_exps/run_mb_mpo_hyperparam_study_num_models.py

The experimental data is available under the following link:

https://www.dropbox.com/sh/0qtk18h4mclgul6/AACMOCYHFnosduLYD4QcZSHqa?dl=0