MOTO: Offline to Online Fine-tuning for Model-Based Robot Learning

Anonymous Authors

Baselines

MOTO

Behavior Cloning

IQL

DreamerV2

No Epistemic Uncertainty

No Policy Regularization

Full MOTO

Behaviors on the D4RL Franka Kitchen Environment. Behavior cloning manages to solve 1-2 tasks, but diverges quickly and cannot recover. Model-free IQL algorithms achieve some success, but also diverge and have limited capability to self-correct. Model-based algorithm DreamerV2 manages to solve the task ocasionally, but learns unsafe reward-hacking behaviors. The MOTO algorithm and it's ablations solve the task in an efficient and safe manner.

Results and Ablations

Kitchen

Top: MOTO outperform both model-based offline and online learning algorithms, as well as model-free methods designed for offline, online and hybrid RL settings. Right: We perform full ablations on the desgn choices involved in the MOTO algorithm. All components contribute positively to final performance and outperform the naively training the baseline DreamerV2 algorithm from offline data.

Meta World

Results on 10 Meta World tasks. All environments have image-based observations and sparse rewards. Prior datasets have 10 trajectories from a scripted policy. Model-based methods perform well, with both MOTO and Dreamer reaching 80% average success rate within 500 online episodes. In the same time, model-free methods struggle as they only receive supervision for both representation and policy learning through a sparse reward signal and cannot generalize over the entire distribution of randomized object positions.

Model Generalization

Top: We evaluate the trained predictive model at the end of the offline pre-training phase on the "partial" task. Under the expert rollout, the model correctly predicts the microwave, kettle, bottom burner and light switch in the correct configuration, even though the training dataset does not contain a configuration with those four objects.

Left: We evaluate the model's generalization capabilities at the end of the offline pre-training phase. It correctly predicts rewards of up to 4 on successful episodes in the "partial" task, even though the maximum dataset reward is 3. When doing rollouts in the learned model, the policy solves all four objects in the task and reaches rewards of up to 4.