MBRL-Amortization

Evaluating Model-Based Planning and Planner Amortization for Continuous Control

Abstract:

There is a widespread intuition that model-based control methods should be able to surpass the data efficiency of model-free approaches. In this paper we attempt to evaluate this intuition on various challenging locomotion tasks. We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning; the learned policy serves as a proposal for MPC. We show that MPC with learned proposals and models (trained on the fly or transferred from related tasks) can significantly improve performance and data efficiency with respect to model-free methods. However, we find that well-tuned model-free agents are strong baselines even for high DoF control problems. Finally, we show that it is possible to distil a model-based planner into a policy that amortizes the planning computation without any loss of performance.

Videos of amortized policy performance:

Below we present videos showing the behaviour of amortized policies learned from scratch using MPC+MPO+BC on the various tasks presented in the paper (best results from Figure 2 in the main text). Each video has 5 episodes. For the go-to-target-pose (GTTP) tasks we execute the mean of the amortized policy's action distribution. For the Walking tasks, which have little randomization in the initialization of each episode, we show four stochastic trajectories sampled from the amortized policy followed by the execution of the mean (final episode in each video).

op3_gttp_amortizedpolicy_mean.mp4

OP3 GTTP

(fixed silhouette is the target pose)

ant_gttp_amortizedpolicy_mean.mp4

Ant GTTP

(fixed silhouette is the target pose)

op3_walk_forward_amortizedpolicy.mp4

OP3 Walking Forward

ant_walk_forward_amortizedpolicy.mp4

Ant Walking Forward

op3_walk_backward_amortizedpolicy.mp4

OP3 Walking Backward

ant_walk_backward_amortizedpolicy.mp4

Ant Walking Backward

Behaviours sampled from the task-agnostic proposal:

Below we show videos of stochastic samples from the task-agnostic proposal (depends only on proprioceptive observations) that we use for the planner results (Section 5.1) and the transfer experiments (Sections 5.3 & 5.4). As this proposal lacks task-specific information it can only learn average behaviour for the GTTP tasks which amounts to walking in random directions. On the Walking tasks the proposal does learn to walk reasonably fast but can easily collide with walls as it does not have access to the relative target direction and consequently, cannot correct for heading errors. As before, we show 5 episodes per task.

op3_gttp_uninformedproposal.mp4

OP3 GTTP

(fixed silhouette is the target pose)

ant_gttp_uninformedproposal.mp4

Ant GTTP

(fixed silhouette is the target pose)

op3_walk_forward_uninformedproposal.mp4

OP3 Walking Forward

ant_walk_forward_uninformedproposal.mp4

Ant Walking Forward

op3_walk_backward_uninformedproposal.mp4

OP3 Walking Backward

ant_walk_backward_uninformedproposal.mp4

Ant Walking Backward

Comparison of different MPC/BC variants on the OP3 GTTP task:

Lastly, we compare the behaviours from different learning from scratch variants (MPO, MPO+BC, MPC+MPO, MPC+MPO+BC) described in the paper (Section 5.2) on the OP3 GTTP task. This is to give an idea of the relative performance of different variants, and should provide a visual representation of the results from Figure 2 (bottom left). We show 5 episodes each from the amortized policies learned by the different variants, and also present videos showing the actor behaviour for the two MPC variants.

mpcmpobc_amortizedpolicy_mean.mp4

MPC + MPO + BC - Amortized Policy (Mean)

The amortized policy is as good, or slightly better than the actor with the addition of the BC objective (dashed green line in Fig 2).

mpcmpobc_actorwithmpc_pplan0.5.mp4

MPC + MPO + BC - Actor with MPC (p_plan=0.5)

(solid green line in Fig 2)

mpcmpo_amortizedpolicy_mean.mp4

MPC + MPO - Amortized Policy (Mean)

The amortized policy is significantly worse compared to the actor in this setting without the BC objective (dashed red line in Fig 2).

mpcmpo_actorwithmpc_pplan0.5.mp4

MPC + MPO - Actor with MPC (p_plan=0.5)

(solid red line in Fig 2)

mpo_amortizedpolicy_mean.mp4

MPO - Policy (Mean)

(solid orange line in Fig 2)

mpobc_amortizedpolicy_mean.mp4

MPO + BC - Policy (Mean)

(solid blue line in Fig 2)