This page is supplementary material for the paper:
"Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment".
Here we show examples of learned behaviors for our baseline, Model Based Offline Policy Optimization (MOPO) as well as Augmented World Models.
All policies were trained in the same offline dataset, HalfCheetah Mixed from D4RL.
First we show the policies for the original environment (damping = 1, mass = 1), which both perform well.
Reward = 2399
Reward = 2779
Now we consider the case where damping and mass are both lower. This could happen if for example a newer robot was made with lighter materials.
With a lighter body, the MOPO agent uses too much force and flips onto its front. However, the AugWM agent actually performs better than the base, exploiting the lighter torso.
Reward = 1159
Reward = 2864
Now we consider the setting when the damping and mass are both different.
Once again the MOPO agent is unable to control itself and flips over, meanwhile AugWM is unfazed.
Reward = 823
Reward = 2462
Finally, what if the dynamics change during an episode?
There are a multitude of reasons why this could happen, for example if the robot is damaged, or moves to a different room in the real world.
To test this, we consider a 1500 step rollout, where the first 500 steps are in the base test environment (1,1), before the dynamics shift to mass=0.75, damping=0.5.
As we see, the MOPO agent then fails, whereas the AugWM agent is able to adjust its context and thus alter its behavior to make use of the lighter torso.