MOTO: Offline to Online Fine-tuning for Model-Based Robot Learning
Anonymous Authors
Baselines
MOTO
Behavior Cloning
IQL
DreamerV2
No Epistemic Uncertainty
No Policy Regularization
Full MOTO
Behaviors on the D4RL Franka Kitchen Environment. Behavior cloning manages to solve 1-2 tasks, but diverges quickly and cannot recover. Model-free IQL algorithms achieve some success, but also diverge and have limited capability to self-correct. Model-based algorithm DreamerV2 manages to solve the task ocasionally, but learns unsafe reward-hacking behaviors. The MOTO algorithm and it's ablations solve the task in an efficient and safe manner.
Results and Ablations
Kitchen
Top: MOTO outperform both model-based offline and online learning algorithms, as well as model-free methods designed for offline, online and hybrid RL settings. Right: We perform full ablations on the desgn choices involved in the MOTO algorithm. All components contribute positively to final performance and outperform the naively training the baseline DreamerV2 algorithm from offline data.
Meta World
Results on 10 Meta World tasks. All environments have image-based observations and sparse rewards. Prior datasets have 10 trajectories from a scripted policy. Model-based methods perform well, with both MOTO and Dreamer reaching 80% average success rate within 500 online episodes. In the same time, model-free methods struggle as they only receive supervision for both representation and policy learning through a sparse reward signal and cannot generalize over the entire distribution of randomized object positions.
Model Generalization
Top: We evaluate the trained predictive model at the end of the offline pre-training phase on the "partial" task. Under the expert rollout, the model correctly predicts the microwave, kettle, bottom burner and light switch in the correct configuration, even though the training dataset does not contain a configuration with those four objects.
Left: We evaluate the model's generalization capabilities at the end of the offline pre-training phase. It correctly predicts rewards of up to 4 on successful episodes in the "partial" task, even though the maximum dataset reward is 3. When doing rollouts in the learned model, the policy solves all four objects in the task and reaches rewards of up to 4.